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1.0  INTRODUCTION 


This  document  reports  the  architecture  study  results  for  the  real-time  implementation  of 
remote  minefield  detection  system  (REMIDS)  algorithms  using  images  from  the  REMIDS  II 
sensor.  The  real-time  REMIDS  system  consists  of  two  intercommunicating  processors.  The 
processes  that  they  perform  are  referred  to  as  data  compression  and  classification. 

Sections  2.0  and  3.0  and  their  related  subsections  discuss  the  data  compression  and 
classification  algorithms  of  the  REMIDS  II  System. 

Sections  2.1  and  3.1  discuss  the  data  compression  and  classification  algorithms.  Sec¬ 
tions  2.2  and  3.2  discuss  the  performance  of  the  processors  that  were  considered  for  perform¬ 
ing  the  data  compression  and  classification  algorithms. 

Sections  2.3,  2.4,  3.3  and  3.4  determine  a  quantitative  score  for  each  of  the  data  compres¬ 
sion  and  classification  processors.  These  scores  are  determined  through  a  two-part  procedure 
(explained  m  Sertion  1.2)  ^nd  are  to  evaluate  the  processors  executing  the  data  com¬ 
pression  and  classification  algorithms. 

Section  4.0  discusses  the  proposed  hardware  configuration  for  the  REMIDS  II  system  and 
lists  the  unresolved  issues. 

Throughout  this  report  the  following  terms  are  used.  Architecture  refers  to  the  organi¬ 
zational  structure  of  a  processor.  Processor  refers  to  an  implementation  of  an  architecture. 
More  than  one  implementation  of  an  architecture  may  exist.  System  refers  to  an  assembly 
of  processors  that  form  a  whole — in  this  case,  the  REMIDS  system.  Operation  refers  to  the 
actions  performed  when  an  instruction  is  executed.  Examples  are  the  addition  and  multipli¬ 
cation  operations  of  scalar  arithmetic;  the  convolution,  thresholding  and  feature  extraction 
operations  of  image  analysis;  and  the  dilation  and  erosion  operations  of  morphological  image 
processing.  Algorithm  refers  to  a  set  of  operations  and  rules  for  getting  a  specific  output  from 
a  specific  input.  Process  refers  to  an  algoriiiim  that  changes  the  general  form  of  data  in  a  de¬ 
finable  way.  Consequently,  data  compression,  clustering,  and  linearity  and  density  screening 
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are  processes  and  algorithms  since  they  have  image,  vector,  scalar,  and  morphological  data 
types.  Architecture  overhead  refers  to  the  processing  power  required  for  operations  that  do 
not  contribute  to  the  processing  of  algorithm  data,  such  as  I/O  management,  processor  syn¬ 
chronization,  etc.  Architecture  overhead  also  includes  processing  power  that  is  wasted  when 
part  of  the  hardwzire  is  unused  because  the  hardware  is  not  perfectly  fitted  to  the  algorithm. 
For  instance,  during  synchronization  of  multiple  parallel  processors,  some  processors  will 
idle — contributing  nothing  to  performance — while  waiting  for  other  processors  to  complete 
their  operations.  Architecture  head  room  refers  to  the  available  safety  margin  of  processing 
power.  Although  some  processing  power  may  be  unused,  it  is  not  included  as  part  of  the 
architecture  head  room  if  it  is  not  able  to  be  used  {for  example,  idle  time  during  processor 
synchronization). 

The  following  sections  provide  background  material  on  the  sensor  and  the  algorithms 
for  mine  detection,  describe  the  methodology  used  to  evaluate  the  data  compression  and 
classification  processors,  and  summarize  the  results  of  the  study. 

1.1  Background 

The  remote  minefield  detection  system  (REMIDS)  developed  by  the  Waterways  Experimen¬ 
tation  Station  (WES)  uses  a  sensor,  which  includes  active  and  passive  modalities,  to  form 
images  for  remote  sensing  of  anti-personnel  and  anti-armor  mines.  The  sensor  includes  a 
thermal  infrared  (IR)  detector  and  two  (parallel  and  cross)  polarization  sensitive  near  in¬ 
frared  (NIR)  detectors  with  an  active  polarized  IR  loser  illuminator.  The  sensor  is  intended 
to  be  mounted  in  an  autonomous  or  remotely  piloted  air  vehicle  that  will  traverse  navigable 
land  arecLS  to  locate  and  identify  the  boundaries  of  minefields. 

The  algorithms  use  an  adaptive  multi-sensor  fusion  approach  to  finding  the  minefields 
(see  Figure  1,  Minefield  Detection  Algorithm).  The  polarization  information  is  formed  into 
two  iinages;  the  difforence  between  the  parallel  and  cross  polarization  returns,  and  the  total 
of  the  these  two  returns.  These,  plus  the  thermal  image,  make  up  the  three  image  inputs  to 
the  minefield  detection  algorithm. 
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The  algorithms  can  be  divided  into  two  major  sections.  The  top  half  of  Figure  1  (the  part 
above  the  classification  block)  shows  an  adaptive  data  compression  algorithm  that,  with  slight 
variations,  is  executed  on  each  of  the  three  image  types:  polarized  NIR  detectors’  difference, 
polarized  NIR  detectors’  total,  and  IR  thermal  detector.  This  algorithm  determines  those 
pixels  that  are  likely  to  be  part  of  a  mine.  Pixels  are  chosen  based  on  an  analysis  of  edges  in 
the  region  of  the  pixel.  The  lower  half  of  Figure  1  depicts  an  adaptive  classification  algorithm 
that  categorizes  the  pixels  screened  in  the  data  compression  cdgorithm  into  individual  clusters 
and  then  regroups  them  into  objects.  These  objects  cam  then  be  analyzed  for  shape,  size,  and 
spatial  distributions  to  determine  if  they  are  part  of  a  minefield.  The  classification  algorithm 
is  explained  in  detail  in  Section  3.1. 

1.2  Evaluation  Methodology 

Alternative  synthesis  and  evaluation  has  been  constrained,  primarily,  by  time  and  funding.  In 
order  to  provide  the  most  thorough  analysis  practical  in  the  time  available,  several  individuals 
were  assigned  to  evciluate  different  architectures  for  the  data  compression  and  classification 
processors.  To  provide  consistent  evaluation  of  the  architectures  and  to  limit  the  impact  of 
biases,  a  consistent  procedure  of  evaluation  is  used.  The  procedure  of  choosing  processors 
hcis  two  major  steps.  These  steps  are  applied  to  evaluate  the  data  compression  processors 
in  Sections  2.3  and  2.4,  and  the  classification  processors  in  Sections  3.3  and  3.4. 

In  the  first  step,  a  set  of  decision  criteria  and  priorities  are  established.  The  decision 
criteria  are  divided  into  a  mandatory  and  a  desirable  set.  The  mandatory  criteria  are  essential 
for  success  of  the  real-time  project.  The  desirable  criteria  are  assigned  a  weight,  on  a  scale 
of  one  to  ten,  that  reflects  their  importance  for  successful  completion  of  the  program.  The 
most  important  desirable  criteria  receive  a  weight  of  ten.  Other  desirable  criteria  are  given 
weights  in  proportion  to  the  most  important  desirable  criteria. 

The  second  step  is  to  assign  a  score  to  each  processor  for  each  desirable  criteria,  on  a 
scale  ot  one  to  ten,  that  reflects  how  well  each  processor  satisfies  each  of  the  desirable  criteria; 
each  processor  is  also  studied  to  determine  if  each  mandatory  criteria  has  been  satisfied.  If 
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a  particular  architecture  does  not  satisfy  a  mandatory  criteria,  then  it  is  eliminated  from 
further  consideration.  The  processor  that  best  satisfies  the  desirable  criteria  receives  a  score 
of  10.  Other  processors  are  given  scores  proportional  to  a  score  of  10.  Whenever  possible, 
the  scores  cire  based  on  quantitative  information.  Each  processor  then  receives  a  composite 
score — computed  by  summing  the  products  of  the  criteria  weights  and  the  processor  scores 
for  each  criteria — that  is  used  to  justify  the  choice  of  processor  for  the  REMIDS  system. 

1.3  Summary  of  Results 

Three  alternatives  were  evaluated  for  real-time  execution  of  the  Data  Compression  algo¬ 
rithms,  the  AISI  AISI5000,  the  ITI  Series  150  board  set,  and  the  Datacube  Max-Video  board 
set.  It  was  concluded  that  the  Datacube  board  set  provides  the  best  balance  of  cost,  per¬ 
formance,  and  support  for  this  algorithm.  The  alternatives  either  had  significant  risk  of  not 
being  able  to  execute  the  algorithms  in  real  time  or  required  development  of  special  inter 
faces  to  handle  the  line  lengths  from  the  REMIDS  II  sensor  and  the  required  high  resolution 
display.  The  Datacube  board  set  provides  the  required  interfaces,  modules  to  execute  each 
of  the  sections  of  the  algorithm  in  real  time,  and  reasonable  software  and  field  support. 

Five  alternatives  were  considered  for  real-time  execution  of  the  Cleissification  Processor 
algorithms.  These  were  the  AMT  DAP500,  the  N-Cube  N-Cube7,  the  Parsytec  Transputer 
boards,  the  AISI  AISI5000,  and  multiple  68020  boards  from  Force.  It  was  concluded  that  the 
Parsytec  Transputer  boards  provide  the  best  balance  of  cost,  performance,  and  support  for 
these  algorithms.  It  is  anticipated  that  13  Transputers  will  provide  the  required  processing 
power.  The  system  is  expandable  with  boards  that  contain  two  transputers  each.  The 
support  software  is  well  established  and  the  cost  is  well  below  any  of  the  alternatives. 

Section  2.0  e  xplains  in  greater  detail  why  the  Datacube  board  set  was  chosen  for  integrat¬ 
ing  and  executing  the  Data  Compression  algorithm.  Section  3.0  explains  why  the  Parsytec 
Transputer  board  set  wjls  chosen  for  integrating  and  executing  the  Classification  algorithm. 
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2.0  DATA  COMPRESSION  PROCESSOR 


The  data  compression  algorithms  require  extensive  processing  for  each  pixel  acquired  by  the 
sensor.  Two  approaches  exist  for  implementing  the  algorithms  in  real  time.  Since  the  algo¬ 
rithms  were  initially  implemented  for  non-real-time  execution,  we  assumed  a  frame-by-frame 
approach.  An  alternative,  which  must  be  considered,  is  to  implement  the  algorithms  in  a 
line-by-line  approach.  The  principal  difference  between  these  approaches  is  in  the  adaptive 
thresholding  cispects  of  the  algorithm.  The  line-by-line  approach  provides  a  smoother  tran¬ 
sition  when  the  characteristics  of  the  data  change,  but  it  will  also  require  more  computation. 
For  this  evaluation,  it  is  assumed  that  the  existing  frame-by-frame  approach  will  be  fol¬ 
lowed.  However,  one  of  the  desirable  decision  criteria  is  the  availability  of  enough  flexibility 
and  computational  power  to  implement  a  line-by-line  approach. 

An  overview  of  the  data  compression  algorithm  and  hardware  performance  analyses  can 
be  found  in  Sections  2.1  and  2.2.  The  acceptance  criteria  and  evaluation  results  correspond¬ 
ing  to  the  two  steps  of  the  evaluation  process,  explained  in  Section  1.2,  are  contained  'n 
Sections  2.3  and  2.4. 

2.1  Data  Compression  Algorithm  Analysis 

Execution  of  the  data  compression  algorithms  can  be  performed  by  the  generic  processor 
depicted  in  Figure  2,  Generic  Data  Compression  Processor.  Three  image  channels — sum. 
difference,  and  thermal — are  shown  passing  through  the  processor  in  parallel.  The  hardware 
resources  may  be  shared  across  multiple  channels  and  operate  upon  each  channel  sequentially. 
The  classes  of  operations  required  by  the  data  compression  algorithms  can  be  categorized  as 
follows: 

•  Input  Data  Formatting 

•  Image  Buffering 

•  Histogramming  and  Image  Statistic  Computation 

•  Convolution 

•  Thresholding 

•  Feature  Extraction 

•  Image  Display 
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Figure  2.  Generic  Data  Compression  Processor 


These  operations  form  a  required  minimum  set  of  functions  that  any  potential  data  com¬ 
pression  processor  must  support. 

2.2  Data  Compression  Hardware  Performance  Analysis 

There  axe  two  general  classes  of  processing  architectures  that  can  provide  enough  compu¬ 
tational  power  to  execute  the  data  compression  algorithms  in  real  time;  massively  parallel 
and  pipelined.  These  two  parallel  processing  architectures  were  considered  in  developing  an 
implementation  of  the  data  compression  algorithms. 

In  the  fine-grained,  massively  parallel  architecture,  many  simple  processing  elements 
(PEs)  execute  a  common  program  in  unison.  Each  PE  has  local  storage,  as  well  as  access  to 
the  storage  of  PEs  in  a  local  neighborhood.  Each  PE  executes  the  algorithm  on  a  pixel  or 
group  of  pixels.  This  architecture  is  discussed  in  Section  2.2.1. 

The  PE  of  the  pipelined  architecture  transfers  data  in  a  raster-scan  format  through  a 
series  of  processing  stages.  Each  stage  executes  one  or  several  instructions  from  the  algorithm 
on  the  data  that  it  receives.  The  current  stage  then  sends  the  results  to  the  next  stage  for 
execution  of  the  next  set  of  instructions  while  it  processes  the  next  set  of  input  pixels.  The 
pipeline  of  stages  acts  as  a  production  line.  Each  stage  executes  a  portion  of  the  algorithm 
and  passes  the  results  to  the  next  PE.  This  architecture  is  discussed  in  Section  2.2.2. 

2.2.1  Massively  Parallel,  Mesh  Connected  Architectures 

The  first  architecture  that  was  considered  is  the  massively  parallel  architecture.  The  .Applied 
Intelligent  Systems  AISI5000  parallel  processor  is  the  implementation  studied.  It  is  a  single 
instruction  multiple  data  (SIMD)  architecture  consisting  of  an  Nxl  dimensional  array  of 
computational  elements;  the  array  size  can  range  from  128x1  to  1024x1.  The  AISI5000 
array  is  as  wide  as  the  image  because  there  is  one  PE  allocated  for  each  column  in  the  image. 
Enough  memory  is  coupled  directly  to  each  PE  to  hold  the  entire  column  of  the  image  data, 
as  well  as  to  provide  a  large  amount  of  storage  space  for  intermediate  results.  A  single  image 


operation  will  load  instructions  into  the  PEs,  read  rows  of  data  from  the  associated  memory 
to  the  PEs,  perform  the  transformation,  and  then  write  the  resulting  transformed  rows  back 
into  memory.  Rows  cire  read  and  written  sequentially  until  the  entire  image  is  processed. 

The  AISI5000  system  consists  of  three  major  components: 

•  A  general  purpose  68000  host  processor  and  peripherals, 

•  Special  purpose  high  speed  input /output  hardware  for  the  parallel  processor,  and 

•  An  SIMD  array  parallel  processor  and  controller. 

Figure  3,  AISI5000  System,  shows  the  relationship  between  the  host,  input/output  (I/O), 
and  parallel  processor  components.  The  68000  host  includes  memory  and  peripherals;  it  ha^ 
a  conventional  design  and  programming  methodology.  The  I/O  sections  transfer  data  in  and 
out  of  the  memory  associated  with  the  PEs  and  contain  corner- turning  logic.  The  corner- 
turning  logic  serially  buffers  the  data  of  a  complete  image  line,  then  passes  the  entire  row 
into  the  processing  array  memories  in  parallel.  The  data  destined  for  the  parallel  memory 
within  the  PEs  can  come  from  a  number  of  sources,  including  cameras  and  standard  68000 
addressable  random  access  memory  (RAM).  The  data  coming  from  parallel  memory  can  go 
to  a  number  of  destinations,  including  video  monitors  and  the  68000  RAM.  The  parallel 
processing  portion  of  the  AISI5000  consists  of  the  PEs,  their  associated  memory,  and  the 
controller  that  directs  their  function. 

The  linear  array  can  contain  up  to  1024  programmable,  bit-serial  PEs;  each  is  tightly 
coupled  to  its  own  local  single-bit-wide  memory  and  has  connections  to  two  neighboring  PEs 
as  shown  in  Figure  4,  Organization  of  the  AISI5000  Processor  Elements.  Each  PE  has  direct 
access  to  32kbits  of  RAM  within  its  own  column.  The  data  in  all  adjacent  memory  columns 
can  be  accessed  by  a  PE  through  connections  to  its  two  neighboring  PEs.  The  PEs  can 
perform  three  distinct  types  of  functions:  Boolean,  neighborhood,  and  arithmetic.  These 
functions  are  performed  on  single  bits  read  from  the  memory;  the  results  of  the  functions  are 
written  back  to  memory.  A  three-channel  I/O  system  facilitates  movement  of  data  in  and 
out  of  the  parallel  memory.  The  I/O  operates  independently  of,  and  asynchronous!}'  with, 
the  PEs. 


10 


VME  Bus 


Figure  3.  AISI  5000  System 


General  purpose,  programmable,  bit-serial  CPU 


Figure  4.  Organization  of  the  AISI  5000  Processor  Elements 


2.2.2  Pipelined,  Recirculating  Architectures 

The  second  architecture  that  was  analyzed  is  the  pipelined  architecture.  With  a  modu¬ 
lar  faimily  of  programmable,  pipelined  processing  units,  a  real-time  image  processor  can  be 
customized  for  a  specific  algorithm.  For  the  pipelined  architecture,  the  Datacube  and  ITI 
processors  were  considered.  They  can  be  customized  to  a  set  of  algorithms  by  distribut¬ 
ing  the  various  operations  required  for  image  processing  and  analysis  to  separate  Versabus 
Module  Europe  (VME)  boards  that  have  been  designed  to  perform  the  operations  of  image 
algorithms.  Each  board  supports  a  powerful,  flexible  set  of  operations  of  a  particular  class. 
Partitioning  of  operations  to  distinct  hardware  boards  typically  occurs  along  the  following 
boundaries: 

•  Image  Acquisition/Display 

•  Image  Memory 

•  Convolution /Finite  Impulse  Response  (FIR)  Filtering 

•  Histogram/Feature  Extraction 

•  General  Digital  Signal  Processing  (DSP) 

•  Neighborhood  Processing 

These  modules  may  be  arranged  to  support  a  number  of  parallel  architectures,  from  multi¬ 
pass  recirculating  architectures  to  fully  pipelined  architectures.  The  processing  units  are 
designed  to  allow  pixel  data  to  flow  synchronously  through  high-speed  pipelines  at  video 
rates — typically  lOMHz  (10  Million  Hertz  or  10  Million  cycles  per  second).  Multiple  ded¬ 
icated  image  data  buses  are  supported  throughout  a  multi-board  configuration.  For  re¬ 
circulating  implementations,  the  routing  of  these  buses  and  the  interconnection  of  various 
processing  modules  can  be  modified  on-the-fly  through  cross-point  switching  units.  A  sepa¬ 
rate  68000-based  host  processor  provides  overall  control  and  updates  control/configuration 
registers  on  the  image  processing  modules.  A  pipelined  configuration  is  shown  in  Figure  5, 
Typical  Programmable  Pipeline  Module  Image  Processor  Configuration. 

2.3  Data  Compression  Processor  Acceptance  Criteria 

The  first  step  of  the  evaluation  process,  explained  in  Section  1.2,  is  applied  here  for  the  data 
compression  hardware.  Based  on  the  data  compression  algorithms  currently  implemented 
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Host  RAM 


e  5.  Typical  Programmable  Pipeline  Module  Image  Processor  Configuration 


and  considering  real-time  execution  constraints,  the  following  mandatory  and  desirable  hard¬ 
ware  selection  criteria  were  established: 

MANDATORY 

•  700  pixels  per  scan  line 

•  High-resolution  (TOO-f  element)  display  support 

•  Algorithm  functionality 

•  Established  product 

•  Processing  power  above  established  minimum 

•  Size,  weight,  and  power  below  established  maximums 

DESIRED 

•  Algorithm  overhead  (9) 

•  Expandability /Configurability  (7) 

•  Architecture  overhead  (7) 

•  Significcint  number  of  units  in  field  (5) 

•  Quality  of  field  support  (4) 

•  Slow  scan  digital  input  support  (3) 

•  Display  function  support  (3) 

•  Minimum  cost  (2) 

•  MinimjJ  size,  weight,  and  power  (2) 

The  mandatory  items  are  essential  to  the  completion  of  the  real-time  REMIDS  project.  The 
desirable  items  are  given  weights  from  one  to  ten — shown  here  in  parentheses — that  reflect 
the  ease  or  difficulty  of  integrating  the  individual  hardware  and  algorithms  into  a  real-time 
system  within  the  current  development  constraints. 

The  available  algorithm  overhead  is  the  most  important  of  the  desirable  criteria.  This 
criteria  refers  to  the  ability  of  the  family  of  processor  modules  to  accommodate  the  various 
algorithm  operations  called  for  in  the  data  compression  algorithms.  The  sponsor  prefers 
these  operations  to  be  implemented  as  directly  as  possible  with  a  minimum  of  software 
development — reconstruction  of  operations  through  sets  of  primitive  commands — and  with 
as  few  recirculations  of  the  images  as  possible. 


Expandability  and  architecture  overhead  are  also  desirable  criteria.  Easy  expandability 
may  be  critical  if  the  data  compression  algorithms  require  more  processing  power  than  we 
have  estimated.  Small  increments  are  more  desirable  for  economic  reasons,  but  reducing  the 
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additional  prograunming  effort — required  when  expanding  the  processor  size —  should  have 
a  higher  priority.  Architecture  overhead  refers  to  the  processing  power  required  for  opera¬ 
tions  that  do  not  contribute  to  the  processing  of  algorithm  data,  such  as  I/O  management, 
processor  synchronization,  etc.  Architecture  overhead  also  includes  processing  power  that 
is  wasted  when  part  of  the  hardware  is  unused  because  the  hardware  is  not  perfectly  fitted 
to  the  algorithm.  For  insteince,  during  synchronization  of  multiple  parallel  processors,  some 
processors  will  idle — contributing  nothing  to  performance — while  waiting  for  other  proces¬ 
sors  to  complete  their  operations.  The  programmer  may  also  have  difficulty  utilizing  the 
hardware  to  full  capacity  if  the  development  system  is  inadequate. 

The  sponsor  prefers  a  slow-scan  digital  interface.  The  non-standard  scanner  video  data 
rate  requires  that  pixels  be  clocked  into  the  processor  at  sensor-driven  rates.  A  hardware 
module  that  will  accommodate  these  requirements  will  eliminate  the  need  for  designing  and 
building  custom  interface  hardware  and  software. 

Image  display  functions  are  a  key  component  of  the  data  compression  processor.  For  in¬ 
stance,  overlay  features  are  also  highly  desirable  and  will  simplify  the  task  of  superimposing 
flagged  pixels  (mines)  upon  the  original  image.  Thus,  we  prefer  the  board  set  to  include  dig¬ 
ital  image  display  hardware  and  software  support  features.  In  addition,  the  system  requires 
support  for  high-resolution  displays  because  the  image  line  lengths  exceed  standard  video 
formats. 

The  following  section  applies  these  criteria  to  the  individual  architectures  considered. 
2.4  Data  Compression  Processor  Analysis  Results 

Table  1,  Data  Compression  Processor  Analysis,  summarizes  the  AISI,  Datacube,  and  ITI 
processor  qualities  beised  on  the  criteria  described  in  Section  2.3.  Determining  whether  a 
processor  possesses  adequate  processing  power  is  a  difficult  criteria  to  evaluate.  Besides  raw 
peak  processing  rate  (for  example,  the  number  of  3x3  convolutions  per  second),  overhead 
such  as  host  I/O,  module  reprogramming,  bus  contentions,  etc.,  must  also  be  considered  in 
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Table  1 .  Data  Compression  Hardware  Alternative  Assessment 


*  Given  'literal'  translation  of  existing  algorithms 


establishing  accurate  estimates  of  net  throughput.  Optimization  techniques  that  could  be 
applied  in  both  hardware  and  software  can  provide  significant  performance  gains,  but  they 
are  difficult  to  predict. 

A  throughput  analysis  for  the  AISI  and  the  Datacube  systems  is  summarized  in  Ta¬ 
ble  2,  Data  Compression  Throughput  Analysis  (Datacube  Board  Set),  and  Table  3,  Data 
Compression  Throughput  Analysis  (AISI5000)  . 

We  divMed  the  existing  data  compression  algorithms  into  a  set  of  generic  operations 
(histogram,  convolution,  threshold,  etc.)  that  could  be  mapped  to  the  hardware  architec¬ 
tures  being  studied.  The  processing  time  for  each  operation  was  then  estimated  based  on 
manufacturer’s  benchm2U‘ks  for  identical  or  related  operations.  A  preliminary  estimate  of 
processor  performance  is  possible  when  given  the  required  input  frame  rate,  knowing  the 
sequence  of  generic  operations  that  must  be  performed  per  fraune,  and  knowing  their  respec¬ 
tive  execution  times  on  a  given  set  of  hardwcire.  Adequate  time  margins  must  be  available 
to  accommodate  the  overhead  not  considered  in  this  simplified  analysis. 

The  following  sections  discuss  the  evaluation  of  the  Datacube,  .4ISI5000.  and  ITI  hard¬ 
ware  for  their  ability  to  perform  the  data  compression  algorithms  at  real-time  rates. 

2.4.1  Datacube  Max  Video  Boards 

The  second  step  of  the  evaluation  process,  explained  in  Section  1.2,  is  applied  here  to 
the  Datacube  MaxVideo  boards.  The  Datacube  Maxvideo  board  set  is  particularly  well 
matched  to  the  operations  specified  in  the  current  data  compression  algorithms.  Its  two- 
stage  pipelined  processor — each  stage  consisting  of  multiple  processing  boards — should  pro¬ 
vide  more  than  adequate  timing  margins  for  processing  and  overhead  operations.  Figure  6, 
Two-Stage  Pipeline-based  Data  Compression  Processor,  depicts  a  two-stage  pipeline  pro¬ 
cessor  for  implementing  the  data  compression  algorithms.  For  those  module  functions  that 
do  not  support  processing  of  variable  regions  of  interest  or  of  line  lengths  gn'ater  than  512 
pixels  (for  example,  histogram /feature  extraction),  two  passes  over  the  image  frame  will  be 
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Table  2.  Data  Compression  Throughput  Analysis  (Datacube  Board  Set) 


CONDITIONS  : 

Single  Processor  Shared  Over  Three  Channels 

700  Pixel  by  176  Lines  Per  Frame 

Three  Frames  Every  .5  Second  (6  Frames  Per  Second) 

10  MHz  Pixel  Processing  Rate 

Two-stage  Pipeline 

One  Second  Processing  Latency 

Algorithms  Executed  “Literally” 


Algorithm  Operation 

Passes  Through  Frames 

Processing  Time  (mSec) 

STAGE  I 

Frame  Copy 

3 

37 

Histogram/Statistics 

3 

99 

Point  Filter/Threshold 

2 

24.6 

(3x3  Convolution) 
(Threshold) 

Sobel  Magnitude 

6 

74 

(3x3  Convolution) 
(Mag.  Detect) 
(Threshold) 

234.6 

STAGE  11 

Donut  Filter 

3 

37 

(7x7  Convolution) 
Histogram  /  Statist  ics 

3 

78 

Threshold 

11 

136 

(6  Compares) 

(5  Boolean) 

Feature  Extract 

4 

50 

301 

Note;  Processing  overhead  times 
must  be  factored  in  and  will 
decrease  resulting  timing  margins. 

TIMING  MARGINS  : 

Stage  I  500  -  235  =  265  mSec  (53%) 

Stage  II  500  -  301  =  199  inSec  ('10%) 


Tables.  Data  Compression  Tbrougliput  Analysis  (AISI5000) 


CONDITIONS  : 

Single  Processor  Shared  Over  Three  Channels 

700  Pixel  by  176  Lines  Per  Frame 

Three  Frames  Every  .5  Second  (6  Frames  Per  Second) 

Algorithms  Executed  “Literally” 


Algorithm  Operation 

Processing  Time  (mSec) 

Load  Frames 

18 

Histogram /Statistics 

2400 

Point  Filter /Threshold 

11 

(3x3  Convolution) 
(Threshold) 

Sobel  Magnitude 

33.6 

(3x3  Convolution) 
(Mag.  Detect) 
(Threshold) 

Donut  Filter 

90 

(7x7  Convolution) 
Threshold 

8 

(6  Compjires) 

(5  Boolean) 

Feature  Extraction 

61 

_ J 

2621.6 

Note:  Processing  overhead  times 
must  be  factored  in  aind  will 
decre«ise  resulting  timing  margins. 

TIMING  MARGIN  : 

No  Margin 
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required.  The  results  are  then  combined.  However,  it  may  prove  more  efficient  to  perform  the 
histogram  function  using  Datacube’s  Euclid  DSP  module,  rather  than  wait  for  the  Feature 
Max  module  to  cycle  through  a  full  512x512  frame.  Because  this  board  requires  a  512x512 
image  and  the  REMIDS  image  is  one-third  this  size,  336  lines  of  the  image  will  not  contain 
data.  The  Feature  Max  board  must  process  512  lines  per  pass;  this  is  a  restriction  not  well 
suited  to  REMIDS  images  (176  lines  per  frame). 

Considering  that  the  data  compression  algorithms  are  essentially  fully  developed  and  will 
require  only  minor  parameter  adjustments  in  the  future,  the  Datacube  implementation  will 
require  the  least  amount  of  effort  in  implementing  the  algorithms  in  hardware.  Thus,  the 
Datacube  board  set  provides  a  very  attractive  performance/ cost  quotient.  Some  degree  of 
flexibility  is  available  in  the  sequencing  of  operations  and  the  interconnecting  of  processing 
elements,  particularly  if  we  use  the  cross-point  switch  module.  However,  overhead  times 
in  the  algorithm  operations  amd  PE  synchronization  can  be  expected  to  rapidly  increase  as 
conditional  branching  increases. 

2.4.2  AISI  AISI5000 

The  second  step  of  the  analysis  process,  explained  in  Section  1.2,  is  applied  here  to  the 
AISI5000.  A  single  700  PE  AISI  processor — one  PE  per  image  column — is  not  suited  to 
direct  implementation  of  the  WES  data  compression  algorithms  as  currently  specified.  This 
incompatablity  results  primarily  from  the  very  slow  rate  at  which  multiple  image  histograms 
can  be  computed  (0.4  seconds  per  700x176  frame).  The  AISI  performance  on  the  other 
required  algorithm  operations  is  more  than  adequate.  The  AISI  processor  works  well  with 
algorithm  implementations  that  employ  image  morphology  rather  than  histogramming  to 
perform  adaptive  background  normalization.  The  data  compression  algorithm  could  be  re¬ 
worked  to  take  advantage  of  the  operations  suited  to  the  AISI5000  architecture  and  remain 
functionally  equivalent  to  the  existing  algorithm.  If  the  data  compression  algorithms  were 
reworked,  the  AISI5000  could  meet  the  performance  requirements  for  a  real-time  data  com¬ 
pression  architecture. 
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If  evaluating  different  data  compression  algorithms — optimized  for  extended  capabil¬ 
ity,  robustness,  etc. — was  desirable,  the  relatively  constrained  architecture  using  dedicated 
processing  boards  becomes  rtiuch  less  desirable  and  th^  flexibility  of  the  AISI  linear  array 
architecture  would  be  more  attractive.  Given  the  resources  to  remap  existing  algorithms  to 
the  AISI  system  (employing  some  fundamentally  different  thinking  but  achieving  the  same 
results),  one  can  achieve  a  hardware  architecture  with  much  more  flexibility  in  terms  of  pixel 
precision,  conditional  execution,  and  frame  sizes.  This  flexibility  is  particularly  evident  with 
programmable  algorithm  operators,  such  as  nonlinear  filtering,  shape  analysis,  etc.  However, 
the  data  compression  algorithm  analysis  and  selection  procedure  has  already  been  completed. 
The  advantages  of  flexibility  are  outweighed  by  the  ability  of  the  Datacube  board  set  to  map 
almost  directly  to  the  selected  algorithms. 

2.4.3  ITI  Board  Set 

Because  the  ITI  board  set  does  not  support  a  high- resolution  display  format,  it  has  been 
eliminated  from  further  consideration. 
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3.0  CLASSIFICATION  PROCESSOR 


An  evaluation  of  the  classification  processors,  considered  for  implementing  the  algorithms 
shown  in  the  lower  half  of  Figure  1,  Minefield  Detection  System,  is  discussed  here.  The  classi¬ 
fication  processor  accepts  pixels  from  the  data  compression  processor  and  selects  those  pixels 
belonging  to  minefields.  The  classification  processor  performs  high-level  computations  that 
a^e  quailitatively  different  from  the  image  computations  performed  in  the  data  compression 
adgorithm  shown  in  the  upper  half  of  Figure  1,  above  the  classification  block.  Consequently, 
a  more  flexible,  general  purpose  architecture  is  required.  The  following  sections  describe  the 
classification  algorithm  and  processing  load,  the  candidate  classification  architectures  and 
processors,  and  the  processor  evaluation  results. 

3,.l  Classification  Algorithm  Analysis 

This  section  describes  the  Environmental  Research  Institute  of  Michigan’s  (ERIM)  under¬ 
standing  of  the  classification  algorithm  hnd  determines  the  processing  power  required  for  its 
real-time  execution.  The  analysis  is  based  on  an  examination  of  the  source  code  supplied  to 
ERIM  by  WES. 

The  classification  algorithm  consists  of  a  clustering  algorithm  followed  by  a  linearity 
and  density  screening  (LDS)  algorithm.  Pixels  emerging  from  these  algorithms  are  tagged 
with  a  value  corresponding  to  the  likelihood  that  they  belong  to  a  mine  in  a  minefield.  The 
(i,  y)  position  of  the  pixel  and  its  tag  are  used  to  highlight  the  potential  minefields  in  one 
of  the  original  infrared  (IR)  images  of  the  terrain.  Operator  feedback  on  the  accuracy  of  the 
highlighted  objects  can  then  be  used  to  refine  algorithm  parameters  during  a  survey  flight. 

The  first  of  the  two  classification  algorithms,  the  clustering  algorithm,  receives  pixels 
selected  by  the  data  compression  processor  that  are  likely  to  belong  to  mines.  The  pixels  are 
three  radiance  values  associated  with  the  same  (x,y)  position;  these  pixels  are  the  thermal 
IR,  and  the  sum  and  difference  of  the  two  polarized  NIR  images.  The  clustering  algorithm  is 
based  on  the  principal  that  the  pixels  of  mines  will  produce  similar  trios  of  radiance  values 
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and  that  clustering  of  values  with  similar  radiance  trios  will  segregate  mine  pixels  from 
background  pixels. 


To  identify  clusters,  the  three  radiance  values  are  treated  as  orthogonal  vectors  in  a  three- 
dimensional  space  and  thus  specify  a  single  point  in  space  (denoted  IR^).  Clusters  are  simply 
groups  of  all  pixel  location  trios  within  a  specified  distance  of  one  another  as  determined  by 
a  scaled  distance  formula.  The  scaled  distance — a  variation  of  the  conventional  square-root 
sum-of-squcires  distance — is  defined  in  Section  3.1.1.  Clusters  axe  represented  as  the  mean 
location  of  all  the  points  in  the  cluster.  The  clustering  algorithm  analysis  is  described  in 
detail  in  Section  3.1.1. 

A  new  input  point  is  compared  to  all  other  clusters  by  calculating  the  point’s  scaled 
distance  to  zdl  existing  clusters  and  sorting  these  distances  to  find  the  closest  cluster.  If 
the  scaled  distance  to  the  closest  cluster  is  within  defined  limits,  the  algorithm  merges  the 
point  into  the  cluster.  This  causes  the  cluster  position,  which  is  the  average  of  all  the  points 
contained  in  the  cluster,  fo  shift.  Therefore,  it  must  be  repeatedly  compared  to  all  other 
clusters  and  merged  as  long  as  the  new  position  is  less  than  a  threshold  distance  from  another 
cluster.  The  merge  occurs  once  for  99%  of  all  clusters  containing  a  single  point.  If  the  single 
point  cluster  is  not  close  enough  to  another  cluster,  then  it  remains  a  cluster  with  one  point 
until  another  cluster  is  close  enough  to  be  merged  with  it.  When  the  algorithm  merges  a 
pixel  with  a  cluster,  or  merges  two  clusters,  the  new  mean  center  is  calculated  according 
to  Equation  (2)  in  Section  3.1.1.  In  either  C2ise,  the  point’s  cluster  is  passed  to  the  LDS 
algorithm. 

The  second  of  the  two  classification  algorithms,  the  LDS  algorithm,  examines  all  the 
points  from  each  cluster  contained  in  a  single  frame  of  imagery  to  determine  the  likelihood 
that  a  cluster  corresponds  to  a  minefield.  The  (x,y)  coordinates  in  the  original  image  of 
each  of  the  points  contained  in  a  cluster  are  examined  amd  the  points  are  grouped  into 
individual  objects.  Objects  are  groups  of  pixels  that  are  screened  and  received  from  the  data 
compression  algorithm,  are  parts  of  an  identical  cluster,  and  are  sufficiently  close  together 
in  the  terrain  image  to  be  considered  the  same  terrain  object.  A  linear  regression,  applied 
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to  the  center  of  mass  of  these  objects,  determines  if  they  are  sufficiently  linearly  spaced.  In 
addition,  the  objects’  spatial  density  is  obtained  by  dividing  the  bounding  rectangular  area 
of  the  objects  in  a  cluster  by  the  number  of  objects  in  a  cluster.  These  two  tests  provide  a 
numeric  measure  of  how  closely  the  objects  of  a  given  cluster  conform  to  the  known  pattern 
of  minefield  layout.  The  LDS  algorithm  is  described  in  detail  in  Section  3.1.2. 

3.1.1  Clustering  Algorithm  Analysis 

The  clustering  algorithm  is  the  first  of  two  algorithms  that  make  up  the  classification  al¬ 
gorithm.  A  detailed  description  of  the  clustering  algorithm  and  a  determination  of  the 
processing  power  required  is  provided  here.  Several  functions  included  in  the  source  code 
supplied  to  ERIM  for  evaluation  have  been  dropped  from  the  core  algorithm.  The  core  al¬ 
gorithm  consists  of  only  the  necessary  functions  of  the  algorithm.  The  omitted  functions 
were  found  to  be  either  algorithmically  unnecessary  artifacts  of  specific  implementations  or 
debugging/prototype  features  that  would  not  be  executed  during  a  real-time  demonstration 
flight.  Figure  7,  Core  Clustering  Algorithm,  is  an  operation  flow  diagram  of  the  clustering 
algorithm.  The  clustering  algorithm  transmits  output  data  to  the  LDS  process. 

When  first  initialized,  the  clustering  algorithm  will  go  through  a  short  period  of  higher 
activity  during  which  many  clusters  are  formed  and  merged.  During  this  active  period,  the 
cluster  positions  will  shift  and  thus  move  in  IR^  cis  they  are  merged  with  each  other.  This 
active  period  should  stabilize  with  a  set  of  clusters  that  are  representative  of  the  mines  and 
background  in  the  sensor  terrain.  Thereafter,  pixels  would  mainly  (99%)  be  added  to  the 
existing  clusters.  Occcisionally,  some  clusters  may  be  created.  We  assumed  that  cluster 
creation  would  stabilize  since  we  expect  the  terrain  to  be  uniform  in  vegetation  and  artifacts 
during  the  minefield  search  period. 

In  Figure  7,  as  each  point  arrives  from  the  data  compression  processor,  the  data  is 
converted  to  a  cluster  representation.  Since  a  pixel  can  be  thought  of  as  a  cluster  with 
a  single  entry,  vve  can  process  points  and  clusters  with  the  same  algorithms.  The  data 
compression  processor  calculates  the  new  cluster’s  scaled  distance  to  all  existing  clusters 
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Figure  7.  Core  Clustering  Algorithm 


and  finds  the  smallest  distance.  The  99%  probability  is  assigned  to  the  threshold  test  of 
the  minimum  scaled  distance  because  this  path  through  the  algorithm  corresponds  to  the 
expected  behaviour  of  each  newly  created  cluster.  After  the  cluster  has  been  merged,  the 
second  occurence  of  this  test  will  fail  most  of  the  time.  This  calculation  of  cluster  distance 
must  be  done  to  account  for  any  movement  of  the  cluster  that  occurs  when  two  clusters  are 
merged.  No  further  merging  will  occur  (99%  of  the  time)  and  the  next  pixel  in  the  frame 
will  be  considered.  Since  each  new  pixel  will  be  merged  once  (99%  of  the  time),  the  scaled 
distance  equation  must  be  computed  twice  for  each  pixel  in  the  frame. 

The  last  two  blocks  of  the  real-time  clustering  algorithm  in  Figure  7  reduces  the  mass 
(number  of  points)  in  the  clusters.  The  clustering  algorithm  executes  for  an  extended  period 
of  time  on  a  continuous  stream  of  data  from  the  IR  sensor.  If  the  external  thermal  conditions 
change  and  the  cluster  masses  are  not  periodically  reduced,  the  external  changes  could  have 
a  serious  effect  on  the  reliability  of  the  algorithm.  For  instance,  if  the  sun  goes  behind  a 
cloud,  the  IR  radiance  values  that  formerly  indicated  a  mine  could  shift  to  those  of  a  benign 
background  object.  Conversely,  background  objects  could  appear  to  be  mines. 

The  clustering  algorithm  contains  several  equations  that  make  up  the  majority  of  the 
processing  load.  The  evaluation  of  the  scaled  distance  requires  the  most  computation  in 
the  clustering  algorithm  since  it  requires  many  calculations  and  must  be  evaluated  at  least 
twice  for  each  new  pixel.  Throughout  the  algorithm  analysis,  the  following  assumptions  are 
made: 

•  Image  data  pixels  {Si,Di,Ti)  are  8-bit  values. 

•  A  classification  frame  is  700  x  512  pixels  (three  sequential  IR  sensor  frames). 

•  A  maximum  of  5000  points/frame  will  be  received  from  the  data  compression 
processor. 

•  A  maximum  of  50  clusters  will  be  generated. 

•  The  minimum  object  (mine)  size  is  5  data-compressed  pixels. 

•  Pixel  data  is  transferred  in  a  raster-scan  order. 

•  The  frame  rate  is  i  =  0.66Hz. 

The  assumptions  above  are  consistent  with  the  source  code  supplied  to  ERIM  by  WES. 
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We  have  defined  several  variables  in  order  to  describe  the  clustering  2dgorithm  mathe¬ 
matically.  S,  D,  and  T  represent  the  sum  and  difference  of  the  polarized  sensor  intensities 
ajid  the  thermal  sensor  intensity,  while  nj  represents  the  number  of  points  in  the  ith  cluster. 
The  «inalysis  of  the  classification  algorithm’s  processing  load  assumes  that  Si,  D,,  T,,  ^5,, 
^Di,  and  nj  axe  maintained  for  each  cluster  to  minimize  calculation  of  the  scaled 

distance.  5,-,  Di,  and  Ti  are  the  averages  and  Y^Si,  Y^i  sums  of  all 

the  pixels  in  the  ith  cluster.  Equations  (1-3)  represent  a  form  of  the  clustering  and  merging 
equations  that  requires  minimal  calculation. 


The  scaled  distance  is  defined  as: 


f( 


c  \  2 


S,  -  Si 


'.E£  +  E-dJ  HETy  +  ErJ 
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where  ‘cov’  is  a  constant.  WES  defines  the  scaled  distance  as  the  square  root  of  Equation  (1). 
Because  the  scaled  distance  is  only  used  for  comparison  to  other  scaled  distances  or  to  an 
empirical  constant,  the  square  root  is  unneccesary.  It  edso  increases  computation  time. 
Optimal  software  coding  of  the  scaled  distance  equation  will  require  9  additions,  3  divisions, 
and  6  multiplications. 


When  two  clusters  are  merged,  the  new  cluster  averages  (5'  ,  Z?'-,  Ty)  and  cluster  sums 
(  Y  recalculated  from  the  current  cluster  values  according  to: 

EC'.  _  V  D'  V  T' 

(nj+n,)’  >  (n>  +  ni)’  ^  (n;  +  n.) 

where: 

S 

=  (3) 

Optimal  coding  of  these  equations  will  require  4  additions  and  3  divisions  per  merge. 
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The  data  acquisition  rate  allows  1.5  seconds  to  process  5000  points  and  50  clusters. 
Consequently,  the  computational  load  can  be  computed  as: 

/2{9A  +  6M  +  3D)  +  (4A  +  3D)' 


,  ^  ,  ,  X  50  X  5000 

1.5  seconds 

(3.7A  +  2M  +  1.5D)  MFLOPS  «  7.2MFLOPS. 


I 


0 


(4) 


The  7.2  million  floating-point  operations  per  second  (MFLOPS)  rate  is  used  in  Section  3.1.3 
to  determine  the  total  processing  load  for  the  classification  algorithm. 

3.1.2  Linearity  and  Density  Screening  (LDS)  Algorithm  Analysis 


The  LDS  algorithm  is  the  second  of  two  algorithms  that  make  up  the  classification  algorithm. 
A  detailed  description  of  the  LDS  algorithm  and  a  determination  of  the  processing  power 
required  is  provided  here.  The  LDS  algorithm  groups  pixels  of  the  same  cluster  into  physical 
objects  based  on  their  spatial  (x,  y)  locations  within  the  original  IR  images.  All  the  objects 
within  a  cluster  are  subjected  to  a  linear  regression  test  to  determine  the  linearity  of  the 
objects  in  the  original  IR  images.  The  spatial  densities  of  the  objects  of  each  cluster  are  then 
calculated.  The  linearity  and  density  values  together  provide  a  quantitative  test  of  whether 
a  given  cluster  of  objects  is  a  minefield. 

Figure  8,  Linearity  and  Density  Screening  Algorithm,  is  an  operation  flow  diagram  of 
the  LDS  algorithm.  The  LDS  algorithm  accumulates  a  frame  of  up  to  5000  points  that  were 
selected  by  the  data  compression  processor.  It  then  uses  the  points’  cluster  identification  (a 
value  that  uniquely  identifies  the  cluster  that  the  point  is  a  member  of)  and  {x,y)  image 
coordinates  to  group  the  points  into  objects  with  the  same  cluster  identification  and  similar 
location.  The  objects  in  each  cluster  are  then  analyzed  to  determine  their  linearity  with 
respect  to  one  another  and  the  density  of  the  cluster.  Finally,  the  LDS  algorithm  selects 
the  clusters  with  acceptable  linearity  and  density  values  and  ranks  them  according  to  their 
likelihood  of  being  minefields. 

The  processing  load  for  the  LDS  algorithm  can  be  determined  in  a  manner  similar  to  that 
of  the  clustering  algorithm.  Each  section  in  Figure  8  has  a  separate  processing  load  that  must 
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Figure  8.  Linearity  and  Density  Screening  Algorithm 


be  summed  to  obtain  the  total  LDS  processing  load.  Conditional  execution  statements  (if- 
then-else)  are  assigned  a  probability  of  execution  for  each  path.  The  probability  corresponds 
to  the  maximum  computational  load  value  (consistent  with  the  other  assumed  constraints 
of  the  problem)  that  was  itemized  in  Section  3.1.1. 

The  first  column  builds  objects  from  pixels  by  first  collecting  pixels  in  a  horizontal 
direction  that  are  within  a  specified  distance  from  each  other.  If  the  cluster  identification 
changed  then  the  current  object  boundary  has  been  reached  and  a  new  object  is  started. 
When  this  occurs  the  object  is  combined  in  a  similar  manner  in  the  vertical  direction,  using 
previously  formed  objects  that  have  the  same  cluster  identification. 

Both  the  X  and  y  components  of  point  locations,  which  have  the  same  class,  are  summed 
to  obtain  the  x  and  y  arithmetic  means  of  the  object.  The  means  will  be  used  as  the  center 
of  mass  coordinates  of  the  objects  when  linearity  measures  are  performed.  The  minimum 
and  maximum  x  are  maintained  for  the  object  and  are  used  to  vertically  merge  the  current 
object  segment  to  previous  segments  with  the  same  class. 

A  total  of  5  additions  are  required  to  test  if  a  new  point  is  part  of  the  current  object. 
Five  additions  are  required  to  add  a  point  to  an  object  and  this  occurs  with  a  probability  of 
0.5.  To  search  vertically  requires  90  additions;  the  probability  of  a  merge  is  0.5,  assuming  20 
previous  objects  are  searched  and  10  objects  are  in  the  same  cluster  as  the  current  object. 
These  eissumptions  follow  from  the  uniform  distribution  of  the  mines  and  the  minimum 
mine  size  of  five  data  compressed  points.  Merging  object  sections  requires  5  additions  at  a 
probability  of  0.25  while  starting  a  new  object  requires  no  arithmetic  operations  and  has  the 
same  probability.  The  total  number  of  additions  is: 

(5A)  +  .5(5A)  +  .5(90A  +  .5(5.A))  =  53.75  additions. 

This  value  is  a  per  point  average  load  based  on  the  given  data-dependent  execution  proba¬ 
bilities. 
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The  second  column  calculates  linear  regression  summations  for  all  the  objects  in  each 
cluster  and  also  finds  the  minimum  and  maximum  (x,y)  extent  of  the  potential  minefield 
boundajy  for  each  class.  The  variables  Xc  and  yc  represent  the  center  of  mass  coordinates 
of  objects  in  the  image.  The  regression  summations  must  be  computed  for  Xc,  yc,  Xcyc, 
Xc,  Yc,  and  Uj,  where  Uj  is  the  number  of  objects  in  the  ith  class.  Because  the  minimum 
object  size  is  5  pixels,  a  maximum  of  =  1000  objects  could  be  in  a  frame.  With  optimal 
coding,  20  additions,  13  multiplications,  and  2  divisions  are  required  per  object  in  the  frame 
to  calculate  these  summations. 


The  third  column  calculates  the  linear  regression  coefficients  for  the  ith  cluster: 


,  _  ^  XcYc  -Xc'Z^yc 

'  Exc^-XcExc 


and  bo  =  yc  -  biXc, 


where: 


n, 


and 


Yc  = 


Eyc 

n. 


The  length  of  the  minefield  along  its  long  axis — a  complex  conditional  equation  that  depends 
on  the  orientation  of  the  minefield  in  the  image — is  also  calculated.  Assuming  a  random 
orientation  of  the  minefield  implies  a  0.5  probability  for  all  the  conditional  pathways  in 
calculating  the  minefield  front  length.  These  equations  require  an  average  of  13  additions, 
7  multiplications,  and  4  divisions  per  cluster. 


The  total  LDS  algorithm  processing  rate  is: 

5000(53.75A)  +  100Q(20A  +  13M  +  2D)  +  50(13A  +  7M  +  4D) 

1.5  (5) 

%  (0.29A  +  O.OIM  +  O.OD)  MFLOPS  0.3  MFLOPS 

for  continuous  real-time  performance.  This  is  a  fairly  trivial  rate  compared  to  the  clustering 

algorithm,  provided  that  the  classification  processor  does  not  impose  a  severe  overhead  to 

perform  the  necessary  operations. 


3.1.3  Classification  Processing  Load 


A  realistic  determination  of  the  processing  load  for  the  entire  classification  algorithm  is 

provided  here.  The  value  is  the  sum  of  the  arithmetic  loads  determined  in  Sections  3.1.1 

I 
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auld  3.1.2  plus  external  communication  overheads.  The  processing  load  value  must  include 
sufficient  head  room  so  that  unforeseen  complexities  in  the  implementation  of  the  algorithms 
will  not  require  the  selected  classification  processor  to  be  replaced  by  one  with  more  pro¬ 
cessing  power. 

The  clustering  and  IDS  algorithms  require  a  minimum  numeric  processing  load  of  about 
7.5MFLOPS  consisting  of  4.0M  additions,  2.0M  multiplications,  and  1.5M  divisions.  The 
7.5MFLOPS  value  must  realistically  be  doubled  to  allow  for  non-optimal  coding  and  unfore¬ 
seen  problems  in  implementing  the  algorithms  for  real-time  performance.  Doubling  the  re¬ 
quired  load  is  a  rough  rule  of  thumb  in  accordance  with  good  engineering  practice.  Therefore, 
the  required  numeric  processing  rate  for  the  clustering  and  LDS  algorithms,  not  considering 
other  overhead  sources,  increases  to  15MFL0PS. 

Additional  processor  time  will  be  required  for  the  overhead  instructions  of  program  flow 
control,  data  loading,  etc.  Numerous  instruction  traces  for  vonNeumann  processors  put  these 
types  of  overhead  at  less  than  20  percent  of  the  total  instruction  stream  for  this  kind  of  a 
numerically  intensive  algorithm.  In  addition,  these  operations  are  mainly  of  simple  integer 
type  and  execute  much  faster  than  the  more  complex  floating-point  arithmetic  instructions. 
Consequently,  an  additional  load  of  about  3  million  instructions  per  second  (MIPS)  should 
be  added  to  the  total  processor  load. 

Communication  with  the  data  compression  processor  is  an  additional  source  of  processing 
load.  The  communication  involves  receiving  the  selected  pixels  with  their  (x,y)  coordinates 
and  transmitting  back  the  locations  and  rankings  of  individual  mines  after  classification. 
The  input  pixels  and  locations  will  likely  consist  of  one  32-bit  word  containing  the  three 
IR  values  (5;,  Di,  Ti)  and  another  32-bit  word  containing  the  associated  (x,j/)  coordinates 
(16  bits  each).  Therefore,  the  data  transfer  rate  is  40,000  bytes  (5000x8)  in  1.5  seconds 
or  about  26,  700  bytes  per  second.  Sending  the  classification  results  back  for  display  would 
require  the  transfer  of  a  maximum  of  1000  objects,  consisting  of  5  bytes  each  (for  x.  //.  and 
rank),  in  1.5  seconds.  This  is  about  3,300  bytes  per  second.  The  total  is  30,000  bytes  per 
second.  Although  the  communication  rate  appears  small,  specific  architectures  may  have 
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lajge  I/O  overheads.  If  ten  instructions  are  required  per  byte  moved,  then  an  additional 
0.3 MIPS  should  be  added  into  the  classification  algorithm  processing  load. 

In  summary,  the  classification  algorithm  will  require  about  15MFLOPS  and  3-4 MIPS  to 
maintain  real-time  execution  plus  0.3  MIPS  for  data  compression  to  classification  processor 
communication.  Any  additional  overhead  that  a  given  processor’s  architecture  imposes  on 
the  algorithm  will  be  discussed  in  Sections  3.2.1  through  3.2.5. 

3.1.4  Fixed-Point  Precision  Requirements 

An  estimate  of  the  precision  requirements  of  classification  Equations  (1-3)  is  provided  here. 
The  required  precision  is  used  to  determine  the  minimal  precision  (number  of  bits)  that  will 
provide  accurate  classification  results.  The  precision  required  to  evaluate  Equations  (1-3)  is 
estimated  by  considering  the  range  of  the  quantities  they  contain  and  the  operations  with 
which  they  are  used. 

The  following  assumptions  are  used  in  the  analysis: 

•  The  ‘cov’  is  an  8-bit  quantity. 

•  Equation  (1)  is  required  to  have  8- bit  accuracy. 

•  The  maximum  number  of  points  in  the  ith  cluster  is  n^  =  255. 

Section  3.1.1  states  that  Si,  Di,  Ti,  ^Si,  ^.nd  n;  are  required  to  compute 

the  distance  and  merging  equations  for  each  cluster.  The  number  of  points  (n^)  in  the  ith 
cluster  requires  a  minimum  of  8  bits  (Log2255)  of  precision.  ^  D,,  and  ^  T;  each 

have  a  range  from  0  to  255^  (255  points  x  255  intensities)  and  require  16  bits  (Log265025)  of 
precision  each.  5,,  Di,  and  Ti  have  the  same  range  as  S,,  Di,  and  T,  (0-255)  because  they 
each  contain  n  pixels  and  are  each  divided  by  n.  Thus,  their  precisions  will  also  be  8  bits. 
The  precisions  of  the  individual  values  can  be  used  along  with  Equations  (1-3)  to  calculate 
the  maximum  precision  necessary  for  the  clustering  algorithm. 

Equation  (1)  of  Section  3.1.1  can  be  divided  into  several  parts. 
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requires; 

2  Log2  (^{255  +  255)^  «  18  bits  (7) 

of  precision  to  the  left  of  the  decimal  point  for  the  worst-case  values  (cov  =  1,  n;  =  nj  =  255) 
and: 

2  Log2  (^^(1  +  1)^  «  14  bits  (8) 

of  precision  to  the  right  of  the  decimal  point  for  the  worst-case  values  (cov  =  255,  n^  =  n^  = 
1).  The  denominators  in  the  three  terms  of  Equation  (1): 

(  V  f  y 

are  about  2n  times  as  large  as  each  numerator  because  the  denominator  contains  twice  the 
sum  of  n  values  that  have  the  same  range  as  those  in  the  numerator.  The  required  precision 
is: 

Log2((2n)^)  =  18  bits  (10) 

to  the  right  of  the  decimal  point  when  the  worst-case  (n  =  255)  is  used. 

To  compute  the  precision  to  the  left  of  the  decimal  point  for  these  terms,  we  must 
first  consider  the  order  that  the  terms  are  evaluated.  Optimal  coding  for  speed  requires 
the  squaring  to  occur  last  so  that  only  one  squaring  operation  is  performed  for  each  term. 
Previous  to  squaring,  the  numerator  and  denominator  are  evaluated  and  their  quotient  is 
formed.  The  numerator  will  never  be  larger  than  the  range  of  5,,  Di,  and  Ti  (255  with  a 
precision  of  Log2255  =  8  bits)  and  the  denominator  never  larger  than  twice  the  range  of 
and  (2  x  65025  =  130050  with  a  precision  of  Log2l30050  =  ITbits). 
Division  of  the  numerator  by  the  denominator  will  place  the  precision  to  the  right  of  the 
decimal  point.  Consequently,  the  minimal  precision  required  to  the  left  of  the  decimal  point 
is  17  bits. 

Equations  (1-10)  provide  enough  information  for  an  estimation  of  the  overall  fixed-point 
precision  requirements  of  the  clustering  and  merging  algorithms.  Eighteen  bits  are  necessary 
to  the  left  of  the  decimal  point  to  represent  the  maximum  value  of  Factor  (6).  Eighteen 
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bits  are  necessary  to  the  right  of  the  decimal  point  to  retain  sufficient  accuracy  when  the 
fractional  Terms  (9)  are  evaluated.  The  total  requirement  is  36  bits. 

In  conclusion,  32-bit  floating-point  representation  will  best  support  the  accuracy  and 
range  of  values  required  for  the  clustering  and  merging  algorithms.  The  accuracy  required 
at  each  step  in  the  evaluation  of  Equations  (1-3)  never  exceeds  the  24-bit  precision  of  the 
mantissa  of  a  32-bit  floating-point  value.  From  a  resource  point  of  view,  the  36  bits  could  be 
used  more  efficiently — provide  a  larger  range  of  the  same  accuracy  using  the  same  precision — 
if  they  were  used  to  represent  floating-point  variables. 

The  classification  algorithm  has  been  defined  and  its  required  precision  determined;  we 
can  use  this  information  in  the  classification  hardware  performance  analysis  below. 

3.2  Classification  Hardware  Performance  Analysis 

Five  architectures  were  considered  for  the  implementation  of  the  classification  algorithms. 
Two  fine-grained,  single  instruction  multiple  data  (SIMD)  architectures  were  evaluated.  The 
SIMD  architectures  are  made  up  of  many — 1024  in  both  of  these  processors — simple  paral¬ 
lel  PEs,  configured  with  linear  (AISI5000)  and  mesh  (DAP 500)  communication  paths.  The 
analysis  for  the  AISI  AISI5000  and  the  AMT  DAP 500  architectures  are  contained  in  Sec¬ 
tion  3.2.5  and  Section  3.2.1,  respectively.  Three  different  coarse  grained  multiple  instruction 
multiple  data  (MIMD)  architectures  were  evaluated:  one  using  standard  complex  instruc¬ 
tion  set  computer  (CISC)  microprocessors  communicating  on  a  parallel  bus  interconnection 
topology  (68020  microprocessor),  one  using  custom  CISC  microprocessors  communicating 
on  a  hypercube  interconnection  topology  (N-Cube  7),  and  one  using  reduced  instruction  set 
computer  (RISC)  processors  communicating  with  a  flexible  mesh  interconnection  topology 
(T800  Transputer).  The  analyses  for  the  Force  68020,  N-Cube  N-Cube  7,  and  Parsytec  T800 
Transputer  are  contained  in  Section  3.2.4,  Section  3.2.3,  and  Section  3.2.2,  respectively. 
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3.2.1  AMT  DAP  500 


One  of  the  two  implementations  of  an  SIMD,  mesh  connected  architecture — AMT’s  (Ac¬ 
tive  Memory  Technology)  DAP 500  (Distributed  Array  Processor) — is  discussed  here.  The 
DAP500  is  organized  as  a  32x32  mesh  of  single-bit  PEs  and  up  to  iMbit  of  local  memory 
per  element  (see  Figure  9,  DAP  500  System  and  Figure  10,  DAP  500  Processor  Element  Or- 
gemization).  The  system  uses  a  10 MHz  clock  rate.  Most  instructions  can  be  executed  at  this 
rate  to  provide  10^“  Boolean  operations  per  second.  These  features  make  the  DAP  500  well 
suited  for  vector,  matrix,  and  image  operations. 

The  analyses  contained  in  the  fixed-point  and  floating-point  sections  below  use  the  clas¬ 
sification  algorithm  MFLOPS  rate,  which  was  calculated  in  Section  3.1,  and  the  fixed-point 
and  floating-point  processing  rates  of  the  DAP 500,  along  with  some  approximations  for  the 
amount  of  processing  power  consumed  by  overhead  items,  to  determine  how  much  head  room 
the  DAP  500  provides.  If  .the  head  room — defined  as  the  amount  of  extra  processing  power 
available  in  the  basic  system — approaches  100%,  then  the  DAP  500  passes  the  mandatory 
criteria  for  processing  power. 

3. 2. 1.1  Fixed-Point  Performance  Analysis 

We  can  use  the  fixed-point  precision  requirements  to  calculate  the  amount  of  time  the 
DAP  500  will  require  to  execute  the  clustering  and  LDS  algorithms.  First  the  execution 
time  for  processing  Equation  (1)  is  calculated  using  execution  times  published  by  AMT  for 
1024-element  vector  operations  using  32-bit  fixed-point  representation.  The  analysis  is  then 
repeated  using  actual  times  calculated  by  AMT  for  Equation  (1),  assuming  1024-element 
vector  operation,  using  16-bit  fixed-point  representation.  Results  for  the  total  clustering  and 
LDS  algorithm  execution  time  are  extrapolated  from  the  execution  time  of  Equation  (1)  and 
normalized  for  comparison.  By  comparing  these  times  with  the  1.5  second  frame  time,  we 
can  calculate  the  head  room  provided  by  the  DAP500. 
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Figure  9.  DAP500  System 


To  Neighboring  PE’s 


Figure  10.  DAP500  Processing  Element  Organization 


The  DAP  500  is  capable  of  executing: 

•  10?4.  ^2  c  fixed-point  additions  or  subtractions  in  I2fi?ec  or 

•  1024  32-bit  fixed-point  multiplications  or  divisions  in  190/isec. 

When  using  additional  bits  of  precision,  we  assume  a  linear  increase  in  time  for  addition 
and  subtraction  operations  and  a  geometric  increase  of  time  for  multiplication  and  division 
operations.  Consequently,  we  find  that  the  DAP 500  is  capable  of  executing: 

•  1024  36-bit  fixed-point  additions  or  subtractions  in  13.5;isec  or 

•  1024  36-bit  fixed-point  multiplications  or  divisions  in  240.47/;sec. 

The  time  to  calculate  the  clustering  and  merging  equations  of  5000  points  for  50  clusters 
is  then: 

50  X  5000  X  (22^^  +  1.31  ^conds. 

The  IDS  calculations  were  estimated  to  require  0.3MFLOPS;  this  is  5%  of  the  clustering 
and  merging  processing  load.  AMT  estimates  that  data  I/O  consumes  about  8%  of  the 
processing  power  of  the  DAP 500.  Therefore,  we  increase  the  time  by  5%  to  account  for  the 
LDS  calculations: 

1.31  seconds  -f  0.05  x  1.31  =  1.37seconds 

\ 

and  then  increase  it  by  8%  to  account  for  data  I/O: 

1.37  seconds  -f  0.08  x  1.37  =  1.48  seconds. 


The  1.48  seconds  is  very  close  to  the  allotted  processing  time  of  1.5  seconds. 


In  conclusion,  36-bit  fixed-point  processing  will  just  beirely  allow  the  processing  to  be 
done  at  real-time  rates.  The  head  room  is: 


(1.50  -  1.48) 
1.48 


X  100%  =  1% 


This  pecentage  of  head  room  does  not  approach  the  100%  head  room  that  was  mandated  by 
good  engineering  practice. 
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Actual  processing  times  for  the  scaled-distance  equation,  shown  in  Equation  (1)  and 
evaluated  at  AMT,  are  used  to  provide  a  more  realistic  value  cf  the  DAP 500  head  room. 
The  following  additional  aissumptions  were  used: 

•  16-bit  fixed-point  format  is  appropriate  for  the  problem, 

•  All  1024  calculations  would  have  the  same  fixed-point  representation  and  would 
require  identical  normalization,  and 

•  No  processors  would  be  idle  in  any  given  frame. 

The  measured  running  time  was  735^sec  for  1024  evaluations,  which  is  equivalent  to: 

50  X  5000  j 

- — - 735/iseconds  =  0.18  seconds 

1024  ^ 

for  the  250,000  (50  x  5000)  evaluations  required  in  the  worst-case  scenario. 

These  calculations  were  performed  using  16-bit  precision  and  must  be  normalized  with 
respect  to  the  36-bit  fixed-point  processing  times  computed  previously.  Linear  increases 
for  addition  and  subtraction  and  geometric  increases  for  multiplication  and  division  will 
be  used  again.  Section  3.1.1  shows  that  the  distance  equation  consists  of  9  additions  and 
subtractions  and  9  multiplications  and  divisions.  The  time  to  compute  250,000  evaluations 
of  Equation  (1)  is: 

i  +  (I)  =  O  Wseconds. 

The  total  running  time  for  the  clustering  algorithm  requires  a  consideration  of  the  fol¬ 
lowing.  Equation  (1)  must  be  computed  at  least  twice  per  frame.  The  merging  equations 
must  also  be  computed  once  per  frame  for  each  of  the  5000  points,  times  each  of  the  50 
clusters.  Because  Equation  (1)  requires  9  additions,  3  divisions,  and  6  multiplications  and 
Equation  (2)  requires  4  additions  and  3  divisions,  the  approxiniation  of  the  running  time  for 
the  merging  equations  is  less  than  50%  of  the  scaled  distance  running  time: 

I  ^  ^  ^  I  0.66  0.3  seconds. 

V9-|-3-(-6y 

The  total  running  time  for  the  evaluations  of  the  clustering  equation  is: 

0.66  0.66  -h  0.3  =  1.61  seconds. 
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The  LDS  calculations  were  estimated  to  require  0.3MFLOPS;  this  is  5%  of  the  clustering 
and  merging  processing  load.  AMT  estimates  that  data  I/O  consumes  about  8%  of  the 
processing  power  of  the  DAP  500. 

Therefore,  the  time  can  be  increased  by  5%  to  account  for  the  LDS  calculations: 

1.61  seconds  +  0.05  x  1.61  =  1.69  seconds 
and  then  increased  by  8%  to  account  for  data  I/O: 

1.69  seconds  +  0.08  x  1.69  =  1.83  seconds. 

This  time  is  a  reasonable  estimate  of  the  entire  processing  time  per  frame  including  data 
I/O.  Comparing  1.83  seconds  with  the  1.5  seconds  available  for  these  calculations  shows  that 
there  is  no  head  room  available  for  overhead  items  that  haven’t  been  included  in  this  analysis 
or  for  an  increase  in  the  required  processing  power  of  the  classification  algorithm. 

There  are  a  few  factors  that  haven’t  been  included  in  cadculating  the  1.83  second  pro¬ 
cessing  time.  The  classification  calculations  could  be  heind  coded  by  an  experienced  DAP  500 
programmer  to  make  use  of  the  full  power  of  the  system.  The  cilgorithm  complexity  has 
been  estimated,  so  more  processing  power  may  be  required.  There  aire  other  overhead  items 
such  eis  interprocessor  communication  and  synchronization  that  may  also  require  a  signifi¬ 
cant  percentage  of  the  total  processing  power  available.  The  maximum  number  of  clusters 
is  not  related  to  the  processor  mesh  size  (32x32)  and  may  require  that  a  percentage  of  the 
processors  are  unused  for  a  portion  of  each  iteration. 

3. 2. 1.2  Floating-Point  Performance  Analysis 

The  amount  of  time  that  the  DAP  500  requires  to  execute  the  classification  and  LDS  algorithm 
equations,  when  a  floating-point  representation  for  variables  is  used,  is  estimated  here.  This 
time  is  compared  to  the  1.5  second  frame  time  limit  for  real-time  processing  to  determine  if 
the  required  head  room  is  available. 
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The  DAP  500  is  capable  of  executing: 

•  1024  32-bit  floating-point  additions  or  subtractions  in  105/isec;  this  is  9.75MFLOPS.  | 

•  1024  32-bit  floating-point  multiplications  in  156/xsec;  this  is  6.56MFLOPS. 

•  1024  32-bit  floating-point  divisions  in  212/isec;  this  is  4.83MFLOPS. 


If  the  algorithm  exclusively  contained  multiplications  or  divisions,  the  DAP 500  would 
not  be  able  to  provide  7.5MFLOPS  for  the  real-time  classification  algorithm.  Therefore,  a 
closer  examination  of  the  algorithm  is  necessary. 


Referring  to  Equations  (1-3),  we  see  that  there  are  9-t-9-t-4  =  22  additions,  3-t-3-h3  =  9 
divisions,  and  64-6-f  0  =  12  multiplications  required  to  calculate  the  clustering  and  merging 
equations  for  a  single  point  on  one  cluster.  The  time  to  calculate  the  clustering  and  merging 
equations  of  5000  points  for  50  clusters  is: 


50  X  5000 


I05^sec  156^sec  2l2fisec\ 

^  ^  1  /\r*.  A  I 


1024 


1024 


1024 


=  1.53  seconds. 


Tixe  LDS  cailculations  will  also  require  a  relatively  small  amount  of  time  .  nee  they  were 
estimated  to  require  O.SMFLOPS;  this  is  5%  of  the  clustering  and  merging  processing  load. 
Also,  AMT  estimates  that  data  I/O  consumes  about  8%  of  the  processing  power  of  the 
DAP  500. 


Therefore,  the  time  can  be  increased  by  5%  to  account  for  the  LDS  calculations: 

1.53  seconds  -f  0.05  x  1.53  =  1.61  seconds 
and  then  increased  by  8%  to  account  for  data  I/O: 

l.Olseconds  4-0.08  x  1.61  =  1.74seconds. 

This  time  is  a  reasonable  estimate  of  the  entire  processing  time  per  frame  including  data 


I/O. 


Comparing  this  time  with  the  1.5  seconds  available  for  these  calculations  shows  that  there 
is  no  head  room  available  for  overhead  items  that  haven’t  been  included  in  this  analysis  or 
for  an  increase  in  the  required  processing  power  of  the  classification  algorithm. 
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As  in  section  3. 2. 1.1,  there  «ire  a  few  factors  that  haven’t  been  considered.  The  classi- 
hcatioa  calculations  could  be  hand  coded  by  an  experienced  DAP 500  programmer  to  make 
use  of  the  full  power  of  the  system.  The  algorithm  complexity  has  been  estimated,  so  more 
processing  power  may  be  required.  There  are  other  overhead  items  such  as  interprocessor 
communication  and  synchronization  that  may  also  require  a  significant  percentage  of  the 
total  processing  power  available.  The  maximum  number  of  clusters  is  not  related  to  the 
processor  mesh  size  (32x32)  and  may  require  that  a  percentage  of  the  processors  are  unused 
for  a  portion  of  each  iteration. 

Based  on  these  head  room  values  and  those  of  the  fixed-point  performance  analysis, 
another  processor  should  be  considered  for  implementing  the  classification  algorithm. 

3.2.2  Parsytec  T800  Transputer 

An  evaluation  of  a  RISC  processor  communicating  by  means  of  a  flexible  mesh  interconnec¬ 
tion  topolog}'  is  discussed  here.  A  multiple  transputer  system  from  Parsytec  GmbH  can  be 
configured  to  perform  all  the  processing  functions  necessary  in  the  classification  processor. 
Figure  11,  Transputer-bcised  Classification  System,  illustrates  the  system  level  interconnec¬ 
tion  that  would  be  used  to  complete  the  two  separate  algorithms.  The  tree  structure  on  the 
left  implements  the  clustering  algorithm,  while  the  two  processors  on  the  right  alternately 
perform  the  LDS  algorithm  on  a  frame-by-frame  basis.  When  processing  at  real-time  rates, 
pixel  data  enters  the  clustering  tree  and  classified  pixels  flow  out  to  the  LDS  processors.  A 
full  frame  of  pixels  are  accumulated  alternately  by  each  processor. 

Since  the  clustering  algorithm  requires  a  high  processing  rate  to  maintain  real-time 
execution  rates,  it  must  be  shared  by  several  Transputers.  The  binary  tree  structure  chosen 
facilitates  the  required  search  for  the  smallest  scaled  distance  within  the  clustering  algorithm. 
This  structure  also  minimizes  the  time  required  to  broadcast  the  value  of  a  merged  cluster 
or  a  new  pixel  entering  the  algorithm  during  the  algorithm’s  iterative  step. 
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Parallel  VME  Link 


Figure  1 1 .  Transputer-based  Classification  System 


The  LDS  algorithm  requires  a  small  processor  loiu  that  could  be  provided  by  a  single 
treinsputer.  Two  transputers  are  used  because  LDS  processing  changes  character  from  pixel- 
by-pixel  to  frame-by-frame.  A  smooth  transition  from  one  type  of  processing  to  the  other  is 
provided  by  having  one  transputer  process  a  frame  of  pixel  data  while  the  other  completes 
the  calculations  hat  require  a  complete  frame — up  to  5000  pixels — of  data  to  accumulate 
before  processing  can  proceed.  The  results  are  then  pcissed  along  to  the  data  compression 
processor  for  display. 

Central  control  of  the  multi- processor  is  executed  by  the  first  processor  in  the  clustering 
tree.  The  processor  communicates  with  the  data  compression  processor  over  the  industry 
standcird  VME  bus.  The  primary  messages  passed  are  the  initially  selected  pixels  output  by 
the  data  compression  processor  and  the  final  mine  locations  passed  back  from  the  classifica¬ 
tion  processor. 

3. 2. 2.1  T800  Processor  Capabilities 

The  transputer  nodes  shown  in  Figure  11  consist  of  1MByte  of  local  memory,  a  TSOO  trans¬ 
puter  that  includes  four  standard  transputer  links  for  communication,  4kBytes  (4x2^°  Bytes) 
of  fast  RAM,  and  an  Institute  of  Electrical  and  Electronics  Engineers  (IEEE)  floating-point 
processor.  Figure  12,  TSOO  Transputer  Chip  Organization,  illustrates  the  internal  architec¬ 
ture  of  the  TSOO  processor  module  and  the  internal  resource  interconnections.  The  integer 
and  floating-point  units  can  maintain  a  throughput  of  7  MIPS  and  2MFLOPS — using  32-bit 
representation — respectively,  when  data  resides  in  the  on-chip  memory.  In  addition,  the 
TSOO  control  logic  allows  floating-point,  integer,  and  communications  operations  to  be  over¬ 
lapped;  the  TSOO  is  limited  by  access  to  the  internal  memory  and  interdependencies  between 
instruction  operands. 

The  transputer  links  are  a  primary  attraction  of  these  processors  as  they  provide  a  low 
overhead  communication  channel  between  programs  running  on  two  different  processors. 
The  transputer  system  program  is  initially  loaded  over  the  links  on  power-up  from  a.  non¬ 
volatile  memory  on  the  VME  bus.  During  real-time  execution,  all  data  I/O  occurs  over  the 
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Figure  12.  T800  Transputer  Chip  Organization 


links.  The  eight  byte  messages  required  for  the  classification  algorithm  can  be  transmitted 
at  10  /isec  per  message. 

3.2. 2. 2  T800  Classification  Performance 


The  T800  processor  discussion  shows  that  a  network  of  transputers  have  sufficient  power 
to  execute  the  classification  cilgorithms.  Whether  the  algorithm  should  be  implemented  in 
floating-point  or  fixed-point  is  a  significeint  question.  A  floating-point  algorithm  is  much 
simpler  to  code  and  debug  because  fixed-point  range  overflows  cannot  occur.  However, 
floating-point  operations  are  generally  more  complex  and  on  some  machines  much  slower 
than  their  fixed-point  counterparts. 


For  single  precision  IEEE  floating-point,  the  T800  can  execute  single  additions,  multipli¬ 
cations,  and  divisions  in  0.35  /zsec,  0.55  /isec,  and  0.85  /isec,  respectively. 


Based  on  the  ratios  of  operations  contained  in  the  clustering  and  merging  Equations  (4- 
5),  an  aggregate  rate  of: 

8(0.35)  -f  4(0.55)  -h  3(0.85) 
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=  0.503  ^sec  or  «2.0MFLOPS 


is  required,  assuming  32-bit  representation. 


Using  32-bit  fixed-point,  the  transputer  can  perform  single  additions,  multiplications, 
and  divisions  in  0.05/xsec,  1.95^sec,  and  2.0^sec  respectively.  These  times  correspond  to 
a  throughput  rate  of  1.06  MIPS  for  the  given  instruction  mix  or  about  half  «is  fast  as  the 
floating-point  version.  The  floating-point  unit  is  faster  than  the  integer  unit  on  the  T800 
because  it  was  designed  several  years  after  the  integer  unit  and  uses  newer  technology. 

In  addition  to  the  arithmetic  operations',  several  communication  operations  occur  during 
the  calculations  for  each  input  point  of  the  clustering  algorithm.  Input  pixels  are  broadcast 
to  all  the  processors  for  calculation  of  the  scaled  distance.  Then  each  processor  finds  the 
minimum  distance  and  sends  it  toward  the  root  of  the  tree  where  the  root  node  checks  the 
distance  against  a  threshold.  Either  a  merge  command  is  sent  back  or  the  pixel  remains 
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a  discrete  cluster.  In  all,  a  worst-case  value  of  about  100  bytes  must  be  shipped  around 
the  multiple  processors  at  1.25/isec  per  byte,  or  125/xsec  per  input  pixel.  Much  of  the  data 
transfer  time  is  overlapped  with  the  arithmetic  processing,  but  a  20%  reserve  capacity  in 
the  clustering  algorithm  will  be  required  to  ensure  that  real-time  processing  rates  can  be 
maintained  when  communication  overhead  is  included. 

The  preceding  throughput  calculations  show  that  nine  T800s  are  required  for  the  clus¬ 
tering  algorithm  (9  x  2.0MFLOPS  >  1.2(14.4)  MFLOPS)  and  the  required  20%  reserve  for 
internal  communication.  As  already  stated,  two  additional  T800s  would  be  used  for  the  LDS. 
The  Parsytec  transputers  are  most  economically  purchased  in  multiples  of  two,  so  a  total  of 
twelve  T800s  would  be  used,  with  ten  assigned  to  the  clustering  algorithm.  In  addition,  cen¬ 
tral  control  would  be  accomplished  by  interfacing  the  twelfth  T800  transputer  with  the  VME 
bus.  The  system  described  would  have  more  than  the  desired  100%  performance  margin  and 
would  be  economically  priced  due  to  the  inherent  simplicity  of  transputer- based  computers. 

3.2.3  N-Cube  N-Cube7 

An  evaluation  of  a  group  of  CISC  microprocessors  communicating  on  a  hypercube  inter¬ 
connection  topology  is  discussed  here.  The  N-Cube7,  from  N-Cube  Systems,  is  a  MIMD 
processor  organized  as  a  hypercube  connection  of  several  0.5  MFLOPS  CISC  node  proces¬ 
sors,  each  with  128kBytes  of  local  memory  (see  Figure  13,  Complete  N-CubelO  System,  and 
Figure  14,  N-Cube  Processor  Chip  Organization).  Each  node  heis  direct  connections  to  N 
nodes  where  N  represents  the  dimension  of  the  N  Cube  and  can  communicate  to  every  node 
in  the  hypercube  by  using  neighboring  nodes  as  switching  networks.  Any  two  nodes  can 
communicate  using  less  than  N  nodes  for  communications  routing,  not  counting  the  source 
and  destination  nodes,  in  an  A^-dimensional  hypercube. 

A  five-dimensional,  16-node  hypercube  can  perform  the  major  classification  calculations. 
The  node  processors  have  0.5MFLOPS  capablility,  and  the  clustering  algorithm  is  estimated 
to  require  7.16MFLOPS. 
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Figure  14.  N-Cube  Processor  Chip  Organization 


Adding  the  5%  estimated  for  the  LDS  means  that: 

/^7.16  +  . 05(7.16)  \ 

I - 

nodes  processors  axe  required. 

This  calculation  assumes  that  data  is  always  available  for  the  processor  nodes  and  that 
communication  and  I/O  overhead  is  zero.  The  minimal  system  64-node  N-Cube7  has  enough 
processing  power  to  handle  the  distance  and  merging  calculations  as  well  as  handle  I/O 
overhead  calculations  and  provide  head  room  for  expansion.  The  64-node  connection  also 
allows  the  clustering  algorithm  to  be  conveniently  mapped  to  the  hardware  without  exceeding 
the  50  cluster  maximum. 

3.2.4  Force  68020  CPU  Board  Set 

An  evti.luation  of  a  group  of  standard  CISC  microprocessors  communicating  on  a  parallel  bus 
interconnection  topology  is  provided  here.  In  this  study,  the  Force  central  processing  unit 
(CPU)  boards  were  considered  since  they  axe  high  performance  processors  and  Force  is  a 
reliable  board  supplier.  In  addition,  the  Force  boards  contain  specific  dedicated  hardware 
that  speeds  message  passing  in  a  multi-processor  system.  The  system  would  be  structured 
as  shown  in  Figure  15,  68020-based  Classification  System.  The  multiple  processors  commu¬ 
nicate  over  a  VME  bus  with  each  other  and  the  data  compression  processor. 

The  multi-processor  structuring  of  the  classification  algorithm  would  be  similar  to  that 
for  the  transputers  (see  Section  3.2.2,  Parsytec  T800  Transputer),  but  with  the  tree  structure 
implemented  logically  via  message  pcissing  over  the  VME  bus  rather  than  by  hardwired  links. 
Each  processor  could  load  a  program  from  a  central  memory  or  an  on-board  programmable 
read  only  memory  (ROM). 

3. 2. 4.1  Force  Board  Capabilities 

Each  processor  would  have  a  Motorola  68020/68882  processor  with  256kBytes  of  local  mem¬ 
ory  running  at  20  MHz  and  zero  wait-states  to  provide  a  processing  rate  of  2  MIPS  and 
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0.2MFLOPS  per  processor.  Floating-point  and  integer  operations  may  be  overlapped  sub¬ 
ject  to  instruction  dependencies.  Dedicated  message  passing  hardware  allows  a  message, 
whose  maximum  length  is  eight  bytes,  to  be  broadcast  to  multiple  processors  with  minimal 
latency.  Additional  resources  on  the  board  may  be  used  for  debugging  (serial  ports)  or  by 
real-time  control  software  (timers). 

The  message  passing  hardware  can  broaidcast  a  message  to  a  maximum  of  21  different 
CPU  boards  in  parallel.  A  hardware  first-in  first-out  memory  (FIFO)  allows  a  maximum  of 
eight  bytes  to  be  sent  in  less  than  20/isec. 

3.2. 4. 2  Force  Board  Classification  Performance 

Multiple  68020/68882  CPUs  will  be  required  to  perform  the  classification  algorithm.  The 
number  of  CPUs  that  would  be  required  for  a  floating-point  and  fixed-point  implementation 
of  the  classification  algorithm  is  determined  here.  Given  the  much  slower  apparent  floating¬ 
point  performance  of  the  CPUs  (about  10  to  1),  serious  consideration  must  be  given  to  a 
fixed-point  implementation. 

For  single  precision  floating-point,  the  CPU  can  perform  single  additions,  multiplications, 
and  divisions  in  3.5/isec,  4.5/isec,  and  6.25  psec,  respectively.  For  the  combination  of  multi¬ 
plication,  division,  addition,  and  subtraction  operations  in  the  classification  algorithm,  the 
throughput  is  0.23MFLOPS.  Alternatively,  the  CPU  has  32-bit  fixed-point  operation  times 
of  O.lpsec,  2.2/isec,  and  3.9/isec  for  single  additions,  multiplications,  and  divisions.  These 
times  correspond  to  a  throughput  rate  of  0.7MIPS  for  fixed-point  arithmetic  operations.  This 
rate  should  be  reduced  by  30%  or  more  to  account  for  the  additional  range  checking  that 
must  be  performed  in  a  fixed-point  implementation.  Therefore,  for  the  mix  of  operations 
expected  from  the  classification  algorithm,  the  integer  implementation  would  be  about  twice 
as  fast  as  the  floating-point  implementation. 

Communication  among  processors  is  flexible  because  the  Force  CPUs  allow  data  to  be 
broadcaist  to  many  processors  in  one  communication  cycle.  Only  75  bytes  need  to  be  trans- 
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mitted  in  the  68020  implementation.  However,  the  message-passing  mechanism  of  the  Force 
boards  prevents  much  of  the  communication  from  being  overlapped  with  the  arithmetic  com¬ 
putation.  All  of  these  items  add  up  to  a  50%  overhead  to  preserve  real-time  throughput  rates 
with  the  Force  board  set. 

Comparing  these  performance  figures  to  the  clustering  algorithm  processing  requirement 
of  1.5(14.4)MFLOPS  shows  that  94  CPUs  would  be  required  for  a  floating-point  implemen¬ 
tation  of  the  clustering  algorithm  alone.  In  other  words,  the  CPU  is  not  powerful  enough  to 
perform  real-time  calculations  for  even  one  cluster.  An  integer  implementation  would  require 
about  43  CPUs  for  real-time  execution  rates.  In  addition,  three  extra  CPUs  would  be  used 
for  the  LDS  and  central  control  of  the  algorithm. 

3.2.5  AISI  AISI5000 

An  evaluation  of  a  massively  parallel  group  of  fine-grained  SIMD  single-bit  processors  com¬ 
municating  by  means  of  linear  communication  paths  is  provided  here.  The  AISI  is  a  SIMD 
machine  organized  as  a  linear  array  of  up  to  1024  single-bit  processors  and  32kbits  of  lo¬ 
cal  memory  (see  Figure  3,  AISI5000  System,  and  Figure  4,  Organization  of  the  AISI5000 
Processor  Elements).  The  AISI5000  was  designed  with  a  very  large  scale  integration  (VLSI) 
gate  array — about  2000  gates.  All  of  the  AISI5000  processors  operate  in  lockstep  at  a  7  MHz 
clock  rate.  The  organization  of  the  AISI5000  processor  allows  it  to  be  used  not  only  for  the 
image  operations  it  was  primarily  designed  for,  but  also  for  general  purpose  processing  of 
vectors,  matrices,  and  non-standard  problems  such  as  neural  nets.  Additional  details  of  the 
AISI5000  processor  can  be  found  in  Sections  2.2.1  and  2.4.2. 

The  AISI 5000  is  capable  of: 

•  1024  32-bit  integer  additions  in  3.85^sec, 

•  1024  32-bit  integer  multiplications  in  318.75/isec,  and 

•  1024  32-bit  integer  divisions  in  318.75;rsec. 

According  to  AISI,  these  times  could  be  accurately  extrapolated  from  the  8- bit  addition 
and  multiplication  times  by  assuming  a  linear  increase  from  8  to  32  bits  for  addition  and  a 
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geometric  increase  for  multiplication.  The  time  for  division  is  2tssumed  to  be  the  same  as  for 
multiplication.  This  rough  approximation  becomes  more  accurate  when  more  calculations 
2u:e  performed,  as  for  vector  operations,  because  the  a^lditionai  time  for  division  is  hidden  in 
the  time  for  overhead  calculations. 


Using  the  same  line  of  reasoning  as  in  Section  3.2.1,  the  time  to  calculate  the  clustering 
and  merging  equations  with  the  AISI5000  for  5000  points  and  50  clusters  is: 


I  «.vl5.41usec  „319usec  ,^319usec 
50  X  5000  X  I  22 — 7^. —  +  9—^—  +12  ^ 


0 


1024 


1024 


1024 


)= 


1.72  seconds. 


if  36-bit  precision  is  used.  The  LDS  calculations  will  also  require  a  relatively  small  amount 
of  time  since  they  were  estimated  to  require  0.3MFLOPS;  this  is  5%  of  the  clustering  and 
merging  processing  load.  Also,  AISI  estimates  that  data  I/O  consumes  about  5%  of  the 
processing  power  of  the  AISI5000.  Therefore,  we  could  increase  the  time  by  5%  to  allow  for 
the  LDS  calculations: 


1.72  seconds  +  0.05  x  1.72  =  1.81  seconds 
and  then  increase  it  by  5%  to  account  for  data  I/O: 

1.81  seconds  +  0.05  x  1.81  =  1.90  seconds. 

This  is  a  reasonable  estimate  of  the  entire  processing  time  per  frame  including  data  I/O. 

The  AISI5000  cannot  perform  the  clcissification  operations  at  the  required  real-time  rate 
and  should  not  be  considered  for  real-time  execution  of  the  REMIDS  classification  algorithms. 

3.3  Classification  Processor  Acceptance  Criteria 

The  first  step  of  the  evaluation  process,  explained  in  Section  1.2,  is  applied  here  for  the  clas¬ 
sification  hardware.  The  classification  processor  architecture  recommended  by  this  report 
should  be  the  one  most  likely  to  produce  a  real-time  demonstration  within  the  limited  de¬ 
velopment  period  available.  The  architecture  may  not  be  the  highest  performance  processor 
nor  the  most  suitable  architecture  since  these  attributes  may  make  the  processor  difficult  to 
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program,  purchjise  or  interface.  This  section  will  define  a  set  of  criteria  that  allows  objective 
comparison  of  different  «irchitectures  and  consistent  rating  of  subjective  features  such  as  ease 
of  programming. 

The  selected  criteria  are  split  into  mandatory  and  desirable  groups. 

MANDATORY 

•  Bus  system  compatible  with  the  data  compression  hardware, 

•  Established  product, 

•  Adequate  processing  power,  and 

•  Below  size,  weight,  power  and  cost  thresholds. 

DESIRED 

•  Development  support  for  parallelism  (10) 

•  Head  room  (%)  of  basic  system  (9) 

•  Architecture  I/O  overhead  (7) 

•  Expandability  (7) 

•  Units  in  field  (5) 

•  Field  support  (4) 

•  Cost  (2) 

•  Size,  weight,  power  (2) 

In  the  desirable  list,  the  development  support  for  parallelism  is  considered  to  be  the 
most  important  and  is  given  a  weight  of  10.  Many  processors  working  together  can  perform 
a  task  in  real-time  if  enough  of  them  are  used;  the  difficulty  arises  in  programming  them  to 
work  efficiently  in  parallel.  Some  development  environments  support  parallelism  well,  while 
others  simply  provide  the  parallel  capablility  and  leave  management  to  the  programmer.  The 
programmer  may  also  have  difficulty  utilizing  the  hardware  to  full  capacity  if  the  development 
system  is  inadequate. 


Head  room,  defined  as  the  percentage  of  spare  processing  power  available  in  a  baseline 
system,  is  given  the  next  highest  importance.  Although  additional  processing  power  can  be 
added  in  some  cases  by  providing  additional  processors,  this  is  not  intended  to  be  considered 
part  of  the  head  room  value. 

Expandability  and  architecture  overhead  were  next  in  order  of  importance.  Easy  ex¬ 
pandability  may  be  critical  if  the  classification  algorithm  requires  more  processing  power 
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than  has  been  estimated  in  this  report.  Small  increments  «ire  more  desirable  for  economic 
reasons,  but  reducing  the  additional  programming  effort,  required  when  expanding  the  pro¬ 
cessor  size,  should  have  a  higher  priority.  Architecture  overhead  refers  to  the  processing 
power  required  for  operations  that  do  not  contribute  to  the  processing  of  algorithm  data, 
such  as  I/O  m2uiagement,  processor  synchronization,  etc.  Architecture  overhead  also  includes 
processing  power  that  is  wasted  when  part  of  the  hcirdware  is  unused  because  the  hardwzire 
is  not  perfectly  fitted  to  the  algorithm.  For  instance,  during  synchronization  of  multiple 
parallel  processors,  some  processors  will  idle — contributing  nothing  to  performance — while 
waiting  for  other  processors  to  complete  their  operations. 

Next  in  importance  were  the  number  of  units  in  the  field  and  the  available  field  support. 
The  number  of  units  in  the  field  is  used  to  estimate  the  maturity  of  the  product.  Normally, 
initial  products  are  prone  to  difficulties  that  a  more  mature  product  would  have  previously 
eliminated.  In  the  event  of  a  problem  with  purchased  hardware,  the  field  support  provided 
will  determine  how  much  time  is  lost  before  work  can  resume.  While  some  companies  provide 
spare  boeirds  for  the  user  to  stock  on  location  or  provide  24  hour  replacement  service,  others 
may  require  that  the  failing  board  be  shipped  out  for  repaur. 

The  final  items  were  the  size,  weight,  power  consumption  and  cost.  Within  the  maximum 
design  constraints,  these  criteria  were  used  to  decide  between  two  closely  matched  processors. 

3.4  Classification  Processor  Analysis  Results 

Determining  whether  a  processor  possesses  adequate  processing  power  is  a  difficult  criteria 
to  evaluate.  Besides  raw  peak  processing  rate  (for  example,  the  number  of  floating-point  or 
fixed-point  operations  that  can  be  processed  per  second),  overhead  such  as  I/O,  processor 
synchronization  and  contention,  etc.,  must  aJso  be  considered  in  establishing  accurate  esti¬ 
mates  of  net  throughput.  Optimization  techniques  that  could  be  applied  in  both  hardware 
and  software  can  provide  significant  performance  gains,  but  they  are  difficult  to  predict. 
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Each  of  the  potential  classification  processor  architectures  are  evaluated  using  the  crite¬ 
ria  defined  in  Section  3.3.  Table  4,  Decision  Analysis  Worksheet  (DAP500/N-Cube);  Table  5, 
Decision  Analysis  Worksheet  (Parsytec/Force);  and  Table  6,  Decision  Analysis  Worksheet 
(AISI5000),  summarize  the  results  of  this  evaluation  for  all  of  the  previously  described  classi¬ 
fication  architectures.  The  following  sections  discuss  the  evaluation  of  each  of  the  processors 
and  derive  the  results  presented  in  Tables  4-6. 

3.4.1  AMT  DAP  500 

The  second  step  of  the  evaluation  process,  explained  in  Section  1.2.  is  applied  here  to  the 
DAP  500.  The  DAP  500  met  adl  the  mandatory  criteria.  A  VME  card  cage,  which  is  available 
for  this  system,  facilitates  communication  to  the  data  compression  processor.  Size,  weight, 
power  and  cost  for  the  DAP  500  are  all  acceptable.  The  product  has  been  established  as 
being  readily  available  as  well  «is  being  favored  by  the  sponsor. 

An  analysis  of  the  algorithm  and  its  implementation  on  the  DAP  500  showed  the  fol¬ 
lowing.  When  using  fixed-point  arithmetic,  the  DAP  500  just  barely  performed  the  required 
operations  in  1.5  seconds.  The  overhead,  when  using  fixed-point  arithmetic,  was  only  1%; 
this  is  drastically  below  the  required  value  of  100%  overhead.  When  using  floating-point 
arithmetic,  the  DAP  500  took  1.74  seconds  to  perform  the  required  operations;  this  is  16% 
longer  than  required  and  provides  -16%  head  room.  Because  the  24-bit  mantissa  is  smaller 
than  a  36-bit  fixed-point  value,  a  shorter  execution  time  might  be  expected.  The  difference 
is  due  primarily  to  the  resynchronization  overhead,  incurred  once  for  each  floating-point 
normalization  for  any  of  the  1024  processors.  Section  3.1.4  showed  that  36-bit  fixed-point 
calculations  or  32-bit  floating-point  calculations  are  required  for  the  classification  equations. 
Because  the  DAP  500  cannot  perform  either  of  these  calculations  at  the  required  rate,  the 
DAP 500  should  be  removed  from  the  list  of  systems  being  considered  for  the  real-time  im¬ 
plementation  of  the  classification  algorithms. 
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The  weighted  criteria  for  the  DAP  were  given  the  following  scores.  There  are  currently 
25-30  third  generation  DAPs  in  the  field.  Over  the  past  eight  years,  20  first-  and  second- 
generation  machines  have  been  delivered.  Compared  with  the  other  systems,  this  number  of 
units  in  the  field  is  small  and  therefore  a  score  of  6  is  given.  Field  support  is  rated  at  10  out 
of  a  possible  10  points,  because  spares  could  be  made  available  for  immediate  replacement 
for  a  price  that  is  reasonable  when  compared  to  the  cost  of  the  total  system. 

The  score  for  the  cost  of  the  system  is  4  because  more  powerful  systems  can  be  purchased 
for  a  significantly  lower  cost.  The  size,  weight  and  power  consumption  are  relatively  small; 
therefore,  the  DAP  500  is  given  a  high  score  of  8. 

A  score  of  0  is  given  for  the  expandability  criteria.  Currently  no  modularity  is  designed 
into  the  DAP  500  processor.  More  processing  power  can  be  provided  only  by  duplicating  the 
system.  AMT  has  just  begun  manufacturing  a  DAP600  series  processor  with  a  64x64  array. 
The  cost  for  a  complete  system  is  $340,000  -  $350,000.  Although  the  cost  is  twice  the  cost 
of  the  DAP 500,  the  system  has  four  times  the  number  of  processors  and  an  increased  I/O 
rate. 

A  score  of  10  is  given  for  the  development  support  for  parallelism  criteria.  A  large 
library  of  standard  routines  is  available  and  has  been  optimized  in  eissembly  code  for  the 
DAP.  Fortran  Plus  allows  code  to  be  written  that  is  more  readable  because  the  details  of 
the  hardware  are  transparent  to  the  programmer.  Assembly  code  allows  the  hardware  to  be 
programmed  for  optimum  use  of  resources  when  processing  time  is  critical. 

The  architecture  overhead  associated  with  the  DAP  500,  when  compared  to  the  other 
systems,  is  in  the  middle  range.  Data  I/O  overhead  is  estimated  to  consume  about  8%  of 
the  processing  power  of  the  system;  the  DAP 500  received  a  relatively  high  score  of  7  for  this 
criteria. 

The  composite  score  for  the  DAP 500  is  243.  It  is  ranked  in  3rd  place,  behind  the  Parsytec 
Transputer  and  the  N-Cube  Hypercube  systems. 
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3.4.2  Parsytec  T800  Transputer 

The  second  step  of  the  evaluation  process,  explained  in  Section  1.2,  is  applied  here  to  the 
Paxsytec  Transputer.  The  transputer  multi-processor  meets  all  of  the  required  criteria  for 
performing  the  real-time  REMIDS  classification  algorithm.  A  VME  interface  card  allows 
communication  with  the  rest  of  the  system.  The  system  size,  weight  and  cost  axe  within 
acceptable  limits.  Of  the  architectures  considered,  the  transputers  have  a  performance-per- 
cost  ratio  that  is  one  of  the  best.  Parsytec  is  also  a  well  established  company  with  world 
markets  and  has  shipped  well  over  100  transputer  boards  to  customers. 

Given  the  additional  programming  effort  required  and  the  discussions  in  Sections  3.1.4 
and  3. 2. 2.2  regarding  fixed-point  versus  floating-point,  a  floating-point  implementation  of 
the  algorithm  would  be  preferrable  to  a  fixed-point  implementation.  The  floating-point  im¬ 
plementation  requires  11  processors  and  the  fixed-point  requires  four  less,  but  the  additional 
programming  cost  could  easily  exceed  the  $13,000  hardware  savings.  In  addition,  the  im¬ 
plementation  risk  expands  greatly  since  the  fixed-point  implementation  would  be  much  less 
robust  with  respect  to  input  data  values  and  would  possibly  be  less  precise. 

The  composite  score  for  the  Parsytec  Transputer  is  430,  which  ranks  this  architecture  in 
1st  place. 

3.4.3  N-Cube  N-Cube7 

The  second  step  of  the  evaluation  process,  explained  in  Section  1.2,  is  applied  here  to  the 
N-Cube  Hypercube.  The  N-Cube  met  all  the  mandatory  criteria.  An  N-Cube  system  with 
the  minimal  configuration  of  64  nodes  has  more  than  enough  processing  power  for  the  classi¬ 
fication  algorithms.  The  system  is  easily  upgradable  to  twice  the  number  of  nodes  and  with 
the  addition  of  a  card  cage  may  be  upgraded  to  1024  nodes.  The  data  I/O  bandwidth  to 
eind  from  the  classification  processor  can  be  met  with  the  serial  host  interface  and  is  capable 
of  communication  with  the  data  compression  hardware  if  an  additional  board  is  bought  for 
the  data  compression  card  cage. 
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The  desirable  criteria  scores  are  as  follows.  The  N-Cube  processor  is  well  established, 
as  is  the  Hypercube  concept,  and  N-Cube  currently  has  200  units  in  the  field;  this  score  is 
10.  The  field  support  score  is  8.  The  N-Cube  comes  with  a  1  year  warranty.  Field  support 
consists  of  exchanging  a  defective  board  for  a  replacement  within  24  hours.  The  cost  score 
of  4  was  based  on  a  $157,500  price.  The  size,  weight  auid  power  score  of  6  was  based  on  a 
10  ft^,  2501b,  1150  watt  system.  The  expandability  was  determined  to  be  good  and  rated  a 
score  of  9.  Development  support  was  rated  very  good  and  scored  10.  Architecture  overhead 
was  accepted  as  low  and  scored  an  8.  The  head  room  for  a  minimal  configuration  64-node 
system  is  300%  and  ranked  a  score  of  10. 

The  composite  score  for  the  N-Cube  Hypercube  is  411,  which  ranks  this  architecture  in 
2nd  place  behind  the  Pairsytec  Transputer  system. 

3.4.4  Force  68020  Board  Set 

The  second  step  of  the  evaluation  process,  explained  in  Section  1.2,  is  applied  here  to 
the  Force  68020  board  set.  Based  on  the  performance  evailuation  in  Section  3.2.4,  the 
68020/68882,  using  floating-point  arithmetic,  cannot  be  seriously  considered  for  a  real-time 
implementation  of  the  REMIDS  classification  algorithm.  Using  fixed-point  arithmetic,  46 
CPUs  could  be  configured  together,  but  the  Force  board  limit  of  21  CPUs  in  a  broadcast 
operation  would  mcike  coordinating  the  algorithm  execution  difficult  and  would  increase  the 
data  traffic  on  the  VME  bus.  In  addition,  the  programming  effort  to  control  such  a  large 
multi-processor  would  be  prohibitive. 

The  composite  score  for  the  Force  68020  board  set  is  248,  which  ranks  the  architecture 
in  4th  place  behind  the  Transputer,  Hypercube  and  DAP  systems. 

3.4.5  AISI  AISI5000 

The  second  step  of  the  evaluation  process,  explained  in  Section  1.2,  is  applied  here  to  the  AISI 
AISI5000.  The  AISI5000  met  all  the  mandatory  criteria.  The  VME  interface  is  compatible 
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with  a  large  variety  of  hardware  so  interfacing  to  the  data  compression  hardware  is  not  a 
problem.  AISI  is  an  established  company  with  40  machines  in  the  field.  The  size,  weight, 
power  and  cost  thresholds  were  all  met  comfortably. 

Because  Section  3.2. 1.4  showed  that  36-bit  fixed- point  calculations  are  required  for  the 
clctssification  operations  and  the  AISI5000  is  incapable  of  processing  at  real-time  rates  with 
this  precision,  the  AISI5000  should  be  removed  from  the  list  of  systems  being  considered  for 
the  read-time  implementation  of  the  classification  algorithms. 

Although  the  AISI5000  is  expandable  in  increments  of  128  PEs,  the  full  1024  PE  system  is 
used  as  the  baseline.  No  additional  increments  are  possible  with  the  exception  of  duplication 
amd  next  generation  hardware. 

The  development  support  for  parallelism  is  poor.  Although  the  system  does  handle  par- 
cdlelism  well  and  hais  am  extensive  number  of  commands  amd  librairy  routines  for  vectors,  the 
system  works  primarily  on  image  data.  The  8-bit  image  data  commands  are  not  appropriate 
for  the  floating-point  computations  that  must  be  performed. 

The  architecture  overheaid  for  data  I/O  is  about  5%;  this  is  a  middle  range  score  compared 
with  the  other  architectures  considered. 

The  composite  score  for  the  AISI5000  is  145,  which  ranks  the  architecture  in  5th  place. 
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4.0  PROPOSED  HARDWARE  CONFIGURATION 


An  overall  system  configuration  for  the  REMIDS  II  Airborne  processor  is  depicted  in  Fig¬ 
ure  16,  REMIDS  II  Airborne  Processor.  At  a  high  level,  the  two  major  processor  components, 
data  compression  and  classification,  are  arranged  in  a  configuration  with  a  separate  Host  CPU 
acting  as  a  common  system  controller  for  each  bus.  This  allows  each  processor  to  operate 
independently  with  its  bus  and  to  link  with  the  host  when  data  exchanges  are  required. 
With  their  heavier  processing  loads,  the  data  compression  and  classification  processor  buses 
axe  never  directly  coupled,  thus  substantially  avoiding  the  inevitable  contention  delays.  The 
proposed  data  compression  hardware  consists  of  programmable  pipeline  modules  produced 
by  Datacube.  The  proposed  classification  hardware  consists  of  several  T800  Transputers 
that  have  been  incorporated  with  additional  memory  into  processing  nodes  by  Parsytec. 

4.1  Operational  Sequencing 


The  operational  sequencing  of  the  airborne  processor  is  summarized  below.  The  inherent 
parallel  operations  are  described  in  a  sequential  manner  for  clarity. 

•  The  host  CPU  initializes  data  compression  and  classification  processors  and  other 
components  of  the  system. 

•  The  scauiner  provides  the  next  line  of  the  3-channel  image  data  that  is  formatted 
by  the  scanner  interface. 

•  The  data  compression  processor  adds  a  new  scan  line  to  its  input  frame  buffer. 

If  the  full  frame  is  assembled,  processing  begins  for  this  frame.  After  frame 
processing  is  complete,  candidate  target  coordinates  and  pixel  values  are  trans¬ 
ferred  to  the  host  CPU  where  they  axe  placed  into  shared  RAM  for  access  by  the 
classification  processor  (see  Figure  17,  Execution  Sequence  Through  the  Data 
Compression  Processor). 

•  The  classification  processor  accepts  candidate  target  information  from  the  host 
RAM  and  performs  clustering  and  LDS  algorithms.  Pixel  locations  where  mines 
are  detected  axe  compiled  and  transferred  to  shared  host  RAM  for  access  by  the 
data  compression  processor. 

•  Detected  mine  locations  are  accepted  by  the  data  compression  processor.  They 
are  then  used  to  assemble  an  image  that  is  superimposed  upon  the  original 
source  image  from  which  the  detected  mines  were  derived.  The  source  images 
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Figure  16.  REM  IDS  II  Airborne  Processor 
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Figure  1 7.  Execution  Sequence  Through  the  Data  Compression  Processor 


are  buffered  within  the  data  compression  processor  and  are  aligned  with  the 
detected  mine  masks  based  upon  the  overall  processing  latency  through  the  image 
processing  chain. 

•  The  scanner  image  with  designated  mines  is  converted  to  a  red,  green,  blue  (RGB) 
format  and  displayed  on  the  high-resolution  monitor  with  suitable  along-track 
compression.  The  uncompressed  video  frame  is  recorded  on  the  video  cassette 
recorder  (VCR)  <md  accepted  by  the  telemetry  unit  for  transmission  to  the  ground 
observation  station. 

4.2  Open  Issues 

Further  development  is  required  in  several  areas  of  the  operational  system  design. 

The  following  issues  remain  unresolved: 

«  Operator  interaction  during  data  collection  and  aurborne  processing. 

•  Recovery  from  data-dependent  errors  that  occur  within  the  classification  proces¬ 
sor,  such  as: 

•  Too  many  clusters,  and 

•  Too  low  a  threshold. 

•  Deletion  of  old  data  from  classification  clusters. 

•  Detailed  bandwidth  analysis. 

•  Telemetry  and  recorder  interface  overhead  effects. 
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